date:20160809

[GitHub] spark issue #14539: [SPARK-16947][SQL] Improve type coercion for inline tabl...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14539
  
**[Test build #63442 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63442/consoleFull)**
 for PR 14539 at commit 
[`daa01a2`](https://github.com/apache/spark/commit/daa01a2753d4b4af4623a3c50961b651ed96cd4c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-08-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r74065336
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/UnsafeArraySuite.scala
 ---
@@ -18,27 +18,131 @@
 package org.apache.spark.sql.catalyst.util
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
 import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData
+import org.apache.spark.unsafe.Platform
 
 class UnsafeArraySuite extends SparkFunSuite {
 
-  test("from primitive int array") {
-val array = Array(1, 10, 100)
-val unsafe = UnsafeArrayData.fromPrimitiveArray(array)
-assert(unsafe.numElements == 3)
-assert(unsafe.getSizeInBytes == 4 + 4 * 3 + 4 * 3)
-assert(unsafe.getInt(0) == 1)
-assert(unsafe.getInt(1) == 10)
-assert(unsafe.getInt(2) == 100)
+  val booleanArray = Array(false, true)
+  val shortArray = Array(1.toShort, 10.toShort, 100.toShort)
+  val intArray = Array(1, 10, 100)
+  val longArray = Array(1.toLong, 10.toLong, 100.toLong)
+  val floatArray = Array(1.1.toFloat, 2.2.toFloat, 3.3.toFloat)
+  val doubleArray = Array(1.1, 2.2, 3.3)
+  val stringArray = Array("1", "10", "100")
--- End diff --

we can use `RowEncoder` for this case:
```
val schema = new StructType().add("array", ArrayType(DecimalType(20, 10)))
val encoder = RowEncoder(schema).resolveAndBind()
val externalRow = Row(new GenericArrayData(Array(Decimal("23213.131231"
val unsafeDecimalArray = encoder.toRow(externalRow)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14539: [SPARK-16947][SQL] Improve type coercion for inli...

2016-08-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14539#discussion_r74064642
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -656,40 +656,37 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] 
with Logging {
* Create an inline table (a virtual table in Hive parlance).
*/
   override def visitInlineTable(ctx: InlineTableContext): LogicalPlan = 
withOrigin(ctx) {
-// Get the backing expressions.
-val expressions = ctx.expression.asScala.map { eCtx =>
-  val e = expression(eCtx)
-  assert(e.foldable, "All expressions in an inline table must be 
constants.", eCtx)
-  e
-}
-
-// Validate and evaluate the rows.
-val (structType, structConstructor) = expressions.head.dataType match {
-  case st: StructType =>
-(st, (e: Expression) => e)
-  case dt =>
-val st = CreateStruct(Seq(expressions.head)).dataType
-(st, (e: Expression) => CreateStruct(Seq(e)))
-}
-val rows = expressions.map {
-  case expression =>
-val safe = Cast(structConstructor(expression), structType)
-safe.eval().asInstanceOf[InternalRow]
+// Create expressions.
+val rows = ctx.expression.asScala.map { e =>
+  expression(e) match {
+case CreateStruct(children) => children
+case child => Seq(child)
+  }
 }
 
-// Construct attributes.
-val baseAttributes = 
structType.toAttributes.map(_.withNullability(true))
-val attributes = if (ctx.identifierList != null) {
-  val aliases = visitIdentifierList(ctx.identifierList)
-  assert(aliases.size == baseAttributes.size,
-"Number of aliases must match the number of fields in an inline 
table.", ctx)
-  baseAttributes.zip(aliases).map(p => p._1.withName(p._2))
+// Resolve aliases.
+val numExpectedColumns = rows.head.size
+val aliases = if (ctx.identifierList != null) {
+  val names = visitIdentifierList(ctx.identifierList)
+  assert(names.size == numExpectedColumns,
--- End diff --

It uses a parser only version of assert that throws a ParseException: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala#L81

Come to think of it, we might need to rename it because people expect that 
assert calls can be elided. That is for a different PR though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14539: [SPARK-16947][SQL] Improve type coercion for inli...

2016-08-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14539#discussion_r74066511
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -769,3 +769,41 @@ case object OneRowRelation extends LeafNode {
*/
   override lazy val statistics: Statistics = Statistics(sizeInBytes = 1)
 }
+
+/**
+ * An inline table that holds a number of foldable expressions, which can 
be materialized into
+ * rows. This is semantically the same as a Union of one row relations.
+ */
+case class InlineTable(rows: Seq[Seq[NamedExpression]]) extends LeafNode {
+  lazy val expressionsResolved: Boolean = rows.forall(_.forall(_.resolved))
--- End diff --

do we really need this? `QueryPlan.expressions` already handles seq of seq 
of expressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14539: [SPARK-16947][SQL] Improve type coercion for inli...

2016-08-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14539#discussion_r74066649
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -769,3 +769,41 @@ case object OneRowRelation extends LeafNode {
*/
   override lazy val statistics: Statistics = Statistics(sizeInBytes = 1)
 }
+
+/**
+ * An inline table that holds a number of foldable expressions, which can 
be materialized into
+ * rows. This is semantically the same as a Union of one row relations.
+ */
+case class InlineTable(rows: Seq[Seq[NamedExpression]]) extends LeafNode {
--- End diff --

should we assert `rows.nonEmpty`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14528: [SPARK-16940][SQL] `checkAnswer` should raise `TestFaile...

2016-08-09 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14528
  
Thank you, @srowen .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14539: [SPARK-16947][SQL] Improve type coercion for inline tabl...

2016-08-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14539
  
@hvanhovell , I may miss something, why do we create this new `InlineTable` 
instead of using `Union`? I think we can create a special `OneRowRelation`(e.g. 
`UnfoldableOneRowRelarion`) in test scope and use it in `ExpressionEvalHelper`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14546: [SPARK-16955][SQL] Using ordinals in ORDER BY and GROUP ...

2016-08-09 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14546
  
Hi, @yhuai .
Could you review this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11157
  
**[Test build #63443 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63443/consoleFull)**
 for PR 11157 at commit 
[`6f7f2d3`](https://github.com/apache/spark/commit/6f7f2d35d6ca8b5a1e3ab71081b4a44d59c54630).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14539: [SPARK-16947][SQL] Improve type coercion for inline tabl...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14539
  
**[Test build #63442 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63442/consoleFull)**
 for PR 14539 at commit 
[`daa01a2`](https://github.com/apache/spark/commit/daa01a2753d4b4af4623a3c50961b651ed96cd4c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class InlineTable(rows: Seq[Seq[NamedExpression]]) extends 
LeafNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14539: [SPARK-16947][SQL] Improve type coercion for inline tabl...

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14539
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63442/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14539: [SPARK-16947][SQL] Improve type coercion for inline tabl...

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14539
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14422: Add rand(numRows: Int, numCols: Int) functions

2016-08-09 Thread xubo245

Github user xubo245 closed the pull request at:

https://github.com/apache/spark/pull/14422


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14422: Add rand(numRows: Int, numCols: Int) functions

2016-08-09 Thread xubo245

GitHub user xubo245 reopened a pull request:

https://github.com/apache/spark/pull/14422

Add rand(numRows: Int, numCols: Int) functions

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


add rand(numRows: Int, numCols: Int) functions to DenseMatrix object,like 
breeze.linalg.DenseMatrix.rand()

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xubo245/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14422.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14422


commit a7a1261f52112a3bca375dd0bed1c1bc0a2e0ed8
Author: å¾æ³¢ <601450...@qq.com>
Date:   2016-07-30T15:43:36Z

Add rand(numRows: Int, numCols: Int) functions

add rand(numRows: Int, numCols: Int) functions to DenseMatrix object,like 
breeze.linalg.DenseMatrix.rand()

commit 054b70ccce73c02cce04caf9f7958cfc555df829
Author: å¾æ³¢ <601450...@qq.com>
Date:   2016-07-30T16:36:30Z

fix RNG 

fix RNG , his makes a new RNG for All element




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14422: Add rand(numRows: Int, numCols: Int) functions

2016-08-09 Thread xubo245

Github user xubo245 commented on the issue:

https://github.com/apache/spark/pull/14422
  
ok



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14422: Add rand(numRows: Int, numCols: Int) functions

2016-08-09 Thread xubo245

Github user xubo245 closed the pull request at:

https://github.com/apache/spark/pull/14422


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14412: [SPARK-15355] [CORE] [WIP] Proactive block replic...

2016-08-09 Thread shubhamchopra

Github user shubhamchopra commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r74088323
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
---
@@ -37,10 +37,11 @@ import org.apache.spark.util.Utils
 class BlockManagerId private (
 private var executorId_ : String,
 private var host_ : String,
-private var port_ : Int)
+private var port_ : Int,
+private var topologyInfo_ : Option[String])
--- End diff --

Added documentation about the parameter in the constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14566: Make logDir easily copy/paste-able

2016-08-09 Thread ash211

GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/14566

Make logDir easily copy/paste-able

In many terminals double-clicking and dragging also includes the trailing 
period.  Simply remove this to make the value more easily copy/pasteable.

Example value:
`hdfs://mybox-123.net.example.com:8020/spark-events.`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14566


commit ed3292dd1a67f1d67d8e6a9ae1d02481f39fe4fd
Author: Andrew Ash 
Date:   2016-08-09T16:11:49Z

Make logDir easily copy/paste-able

In many terminals double-clicking and dragging also includes the trailing 
period.  Simply remove this to make the value more easily copy/pasteable.

Example value:
`hdfs://mybox-123.net.example.com:8020/spark-events.`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14566: Make logDir easily copy/paste-able

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14566
  
**[Test build #63444 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63444/consoleFull)**
 for PR 14566 at commit 
[`ed3292d`](https://github.com/apache/spark/commit/ed3292dd1a67f1d67d8e6a9ae1d02481f39fe4fd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-09 Thread mallman

Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r74092457
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -287,14 +287,14 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
+val schema =
--- End diff --

Thanks for refactoring this.

I think it makes more sense if

defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles())

is called `inferredSchema` and the value of the `if 
(fileType.equals("parquet"))` expression is called `schema`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-09 Thread mallman

Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r74092780
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -287,14 +287,14 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
+val schema =
+  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
 val inferredSchema = if (fileType.equals("parquet")) {
--- End diff --

I just noticed the boolean expression should be `fileType == "parquet"` to 
make it idiomatic Scala. Can you make that change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/14500#discussion_r74094235
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -425,6 +430,111 @@ case class AlterTableDropPartitionCommand(
 
 }
 
+/**
+ * Recover Partitions in ALTER TABLE: recover all the partition in the 
directory of a table and
+ * update the catalog.
+ *
+ * The syntax of this command is:
+ * {{{
+ *   ALTER TABLE table RECOVER PARTITIONS;
+ *   MSCK REPAIR TABLE table;
+ * }}}
+ */
+case class AlterTableRecoverPartitionsCommand(
+tableName: TableIdentifier,
+cmd: String = "ALTER TABLE RECOVER PARTITIONS") extends 
RunnableCommand {
+  override def run(spark: SparkSession): Seq[Row] = {
+val catalog = spark.sessionState.catalog
+if (!catalog.tableExists(tableName)) {
+  throw new AnalysisException(s"Table $tableName in $cmd does not 
exist.")
+}
+val table = catalog.getTableMetadata(tableName)
+if (catalog.isTemporaryTable(tableName)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on temporary tables: $tableName")
+}
+if (DDLUtils.isDatasourceTable(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on datasource tables: $tableName")
+}
+if (table.tableType != CatalogTableType.EXTERNAL) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on external tables: 
$tableName")
+}
+if (!DDLUtils.isTablePartitioned(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on partitioned tables: 
$tableName")
+}
+if (table.storage.locationUri.isEmpty) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on table with location 
provided: $tableName")
+}
+
+val root = new Path(table.storage.locationUri.get)
+val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
+// Dummy jobconf to get to the pathFilter defined in configuration
+// It's very expensive to create a 
JobConf(ClassUtil.findContainingJar() is slow)
+val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration, 
this.getClass)
+val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
+val partitionSpecsAndLocs = scanPartitions(
+  spark, fs, pathFilter, root, Map(), 
table.partitionColumnNames.map(_.toLowerCase))
+val parts = partitionSpecsAndLocs.map { case (spec, location) =>
+  // inherit table storage format (possibly except for location)
+  CatalogTablePartition(spec, table.storage.copy(locationUri = 
Some(location.toUri.toString)))
+}
+spark.sessionState.catalog.createPartitions(tableName,
+  parts.toArray[CatalogTablePartition], ignoreIfExists = true)
+Seq.empty[Row]
+  }
+
+  @transient private lazy val evalTaskSupport = new 
ForkJoinTaskSupport(new ForkJoinPool(8))
+
+  private def scanPartitions(
+  spark: SparkSession,
+  fs: FileSystem,
+  filter: PathFilter,
+  path: Path,
+  spec: TablePartitionSpec,
+  partitionNames: Seq[String]): GenSeq[(TablePartitionSpec, Path)] = {
+if (partitionNames.length == 0) {
+  return Seq(spec -> path)
+}
+
+val statuses = fs.listStatus(path)
+val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", 
"10").toInt
+val statusPar: GenSeq[FileStatus] =
+  if (partitionNames.length > 1 && statuses.length > threshold || 
partitionNames.length > 2) {
+val parArray = statuses.par
+parArray.tasksupport = evalTaskSupport
+parArray
+  } else {
+statuses
+  }
+statusPar.flatMap { st =>
+  val name = st.getPath.getName
+  if (st.isDirectory && name.contains("=")) {
+val ps = name.split("=", 2)
+val columnName = 
PartitioningUtils.unescapePathName(ps(0)).toLowerCase
+// TODO: Validate the value
+val value = PartitioningUtils.unescapePathName(ps(1))
+// comparing with case-insensitive, but preserve the case
+if (columnName == partitionNames(0)) {
+  scanPartitions(
+spark, fs, filter, st.getPath, spec ++ Map(columnName -> 
value), partitionNames.drop(1))
+} else {
+  logWarning(s"expect partition column ${partitionNames(0)}, but 
got ${ps(0)}, ignore it")
+  Seq()
--- End diff --

The Hive only throws exception when there are not allowed character in the 
value, not other cases. I'd like to avoid any configs if no serious problem 
here.


---

[GitHub] spark pull request #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/14500#discussion_r74094542
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -425,6 +430,110 @@ case class AlterTableDropPartitionCommand(
 
 }
 
+/**
+ * Recover Partitions in ALTER TABLE: recover all the partition in the 
directory of a table and
+ * update the catalog.
+ *
+ * The syntax of this command is:
+ * {{{
+ *   ALTER TABLE table RECOVER PARTITIONS;
+ *   MSCK REPAIR TABLE table;
+ * }}}
+ */
+case class AlterTableRecoverPartitionsCommand(
+tableName: TableIdentifier,
+cmd: String = "ALTER TABLE RECOVER PARTITIONS") extends 
RunnableCommand {
+  override def run(spark: SparkSession): Seq[Row] = {
+val catalog = spark.sessionState.catalog
+if (!catalog.tableExists(tableName)) {
+  throw new AnalysisException(s"Table $tableName in $cmd does not 
exist.")
+}
+val table = catalog.getTableMetadata(tableName)
+if (catalog.isTemporaryTable(tableName)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on temporary tables: $tableName")
+}
+if (DDLUtils.isDatasourceTable(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on datasource tables: $tableName")
+}
+if (table.tableType != CatalogTableType.EXTERNAL) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on external tables: 
$tableName")
+}
+if (!DDLUtils.isTablePartitioned(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on partitioned tables: 
$tableName")
+}
+if (table.storage.locationUri.isEmpty) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on table with location 
provided: $tableName")
+}
+
+val root = new Path(table.storage.locationUri.get)
+val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
+// Dummy jobconf to get to the pathFilter defined in configuration
+// It's very expensive to create a 
JobConf(ClassUtil.findContainingJar() is slow)
+val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration, 
this.getClass)
+val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
+val partitionSpecsAndLocs = scanPartitions(
+  spark, fs, pathFilter, root, Map(), 
table.partitionColumnNames.map(_.toLowerCase))
+val parts = partitionSpecsAndLocs.map { case (spec, location) =>
+  // inherit table storage format (possibly except for location)
+  CatalogTablePartition(spec, table.storage.copy(locationUri = 
Some(location.toUri.toString)))
+}
+spark.sessionState.catalog.createPartitions(tableName,
+  parts.toArray[CatalogTablePartition], ignoreIfExists = true)
+Seq.empty[Row]
+  }
+
+  @transient private lazy val evalTaskSupport = new 
ForkJoinTaskSupport(new ForkJoinPool(8))
+
+  private def scanPartitions(
+  spark: SparkSession,
+  fs: FileSystem,
+  filter: PathFilter,
+  path: Path,
+  spec: TablePartitionSpec,
+  partitionNames: Seq[String]): GenSeq[(TablePartitionSpec, Path)] = {
+if (partitionNames.length == 0) {
+  return Seq(spec -> path)
+}
+
+val statuses = fs.listStatus(path)
+val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", 
"10").toInt
+val statusPar: GenSeq[FileStatus] =
+  if (partitionNames.length > 1 && statuses.length > threshold || 
partitionNames.length > 2) {
+val parArray = statuses.par
--- End diff --

This is copied from UnionRDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14539: [SPARK-16947][SQL] Improve type coercion for inline tabl...

2016-08-09 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14539
  
@cloud-fan I had an offline discussion with @rxin about this. His main 
point was that a larger inline table would create an extremely unreadable plan. 
So I came up with this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14540: [SPARK-16950] [PySpark] fromOffsets parameter support in...

2016-08-09 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14540
  
Ah interesting, we might want to report the bug upstream with Py4J - but 
this change looks good to me :) Thanks for getting this working in Python 3 :) 
cc @davies who can maybe take a look as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14567: Python import reorg

2016-08-09 Thread Stibbons

GitHub user Stibbons opened a pull request:

https://github.com/apache/spark/pull/14567

Python import reorg

## What changes were proposed in this pull request?

This patch adds a code style validation script following pep8 
recommendations.

Features:
- add a .editconfig file (Scala files use 2 space indentation, while Python 
uses 4) for compatible editors (almost every editors has a plugin to support 
.editconfig file)
- use autopep8 to fix basic pep8 mistakes
- use isort to automatically split "import" statement and organise them 
into logically linked order (see doc 
[here](https://pypi.python.org/pypi/isort#multi-line-output-modes). The most 
important thing is that it split import statements that imports more than one 
object into several lines. This will increase the number of line of the file, 
but this facilitates a lot file maintainance and file merges if needed.
- add a 'validate.sh' script in order to automatise the correction (need 
isort and autopep8 installed)

You can see similar script in prod in the 
[Buildbot](https://github.com/buildbot/buildbot/blob/master/common/validate.sh) 
project


## How was this patch tested?

Simple tests on my machines has been done (local mode only). There should 
not have any regression or feature change at all with this pull request.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Stibbons/spark python_import_reorg

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14567.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14567


commit 61d6f38470ded699d0a0a751a79f069f1bf66cb1
Author: Gaetan Semet 
Date:   2016-08-09T16:13:55Z

reorg python imports statements

commit 2b719bec3e286e60d5f4f895e9d044c0f6ac0b84
Author: Gaetan Semet 
Date:   2016-08-09T16:13:20Z

autopep8

commit 97540d727679cc72ce13368f56c69ec412b6f162
Author: Gaetan Semet 
Date:   2016-08-09T16:14:10Z

import isort

commit 85a56ec22f4d2829242b0f1a9491bae7d117c18e
Author: Gaetan Semet 
Date:   2016-08-09T16:35:39Z

validate script

commit 84320dd6ba6c7d6c367d422423d2f5c1a95c6f82
Author: Gaetan Semet 
Date:   2016-08-09T16:36:00Z

isort execution




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14540: [SPARK-16950] [PySpark] fromOffsets parameter support in...

2016-08-09 Thread davies

Github user davies commented on the issue:

https://github.com/apache/spark/pull/14540
  
LGTM, merging this into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

2016-08-09 Thread Stibbons

Github user Stibbons commented on the issue:

https://github.com/apache/spark/pull/14180
  
Opened #14567 with Pep8, import reorganisations and editconfig.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14540: [SPARK-16950] [PySpark] fromOffsets parameter sup...

2016-08-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14540


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14567: Python import reorg

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14567
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14567: Python import reorg

2016-08-09 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14567
  
Although there has generally been some resistance to large style-only 
changes, we do enforce import order in Scala/Java including checks. So it seems 
pretty reasonable to do the same in one big go for Python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14567: Python import reorg

2016-08-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14567#discussion_r74098782
  
--- Diff: python/pyspark/context.py ---
@@ -22,22 +22,30 @@
 import signal
 import sys
 import threading
-from threading import RLock
 from tempfile import NamedTemporaryFile
+from threading import RLock
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
 from pyspark.broadcast import Broadcast
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
-from pyspark.serializers import PickleSerializer, BatchedSerializer, 
UTF8Deserializer, \
-PairDeserializer, AutoBatchedSerializer, NoOpSerializer
-from pyspark.storagelevel import StorageLevel
-from pyspark.rdd import RDD, _load_from_socket, ignore_unicode_prefix
-from pyspark.traceback_utils import CallSite, first_spark_call
+from pyspark.profiler import BasicProfiler
+from pyspark.profiler import ProfilerCollector
+from pyspark.rdd import RDD
+from pyspark.rdd import _load_from_socket
+from pyspark.rdd import ignore_unicode_prefix
+from pyspark.serializers import AutoBatchedSerializer
--- End diff --

Expanding these multiple imports seems counter-productive. We don't do it 
in Scala (and in Java you can only import one thing or everything). Is this 
important/canonical for PEP8?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14539: [SPARK-16947][SQL] Improve type coercion for inli...

2016-08-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14539#discussion_r74098318
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -769,3 +769,41 @@ case object OneRowRelation extends LeafNode {
*/
   override lazy val statistics: Statistics = Statistics(sizeInBytes = 1)
 }
+
+/**
+ * An inline table that holds a number of foldable expressions, which can 
be materialized into
+ * rows. This is semantically the same as a Union of one row relations.
+ */
+case class InlineTable(rows: Seq[Seq[NamedExpression]]) extends LeafNode {
+  lazy val expressionsResolved: Boolean = rows.forall(_.forall(_.resolved))
--- End diff --

I needed that piece of code in two places (resolve and type coercion), so I 
made it a val. But I can remove this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/14500#discussion_r74099592
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -425,6 +430,111 @@ case class AlterTableDropPartitionCommand(
 
 }
 
+/**
+ * Recover Partitions in ALTER TABLE: recover all the partition in the 
directory of a table and
+ * update the catalog.
+ *
+ * The syntax of this command is:
+ * {{{
+ *   ALTER TABLE table RECOVER PARTITIONS;
+ *   MSCK REPAIR TABLE table;
+ * }}}
+ */
+case class AlterTableRecoverPartitionsCommand(
+tableName: TableIdentifier,
+cmd: String = "ALTER TABLE RECOVER PARTITIONS") extends 
RunnableCommand {
+  override def run(spark: SparkSession): Seq[Row] = {
+val catalog = spark.sessionState.catalog
+if (!catalog.tableExists(tableName)) {
+  throw new AnalysisException(s"Table $tableName in $cmd does not 
exist.")
+}
+val table = catalog.getTableMetadata(tableName)
+if (catalog.isTemporaryTable(tableName)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on temporary tables: $tableName")
+}
+if (DDLUtils.isDatasourceTable(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on datasource tables: $tableName")
+}
+if (table.tableType != CatalogTableType.EXTERNAL) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on external tables: 
$tableName")
+}
+if (!DDLUtils.isTablePartitioned(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on partitioned tables: 
$tableName")
+}
+if (table.storage.locationUri.isEmpty) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on table with location 
provided: $tableName")
+}
+
+val root = new Path(table.storage.locationUri.get)
+val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
+// Dummy jobconf to get to the pathFilter defined in configuration
+// It's very expensive to create a 
JobConf(ClassUtil.findContainingJar() is slow)
+val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration, 
this.getClass)
+val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
+val partitionSpecsAndLocs = scanPartitions(
+  spark, fs, pathFilter, root, Map(), 
table.partitionColumnNames.map(_.toLowerCase))
+val parts = partitionSpecsAndLocs.map { case (spec, location) =>
+  // inherit table storage format (possibly except for location)
+  CatalogTablePartition(spec, table.storage.copy(locationUri = 
Some(location.toUri.toString)))
+}
+spark.sessionState.catalog.createPartitions(tableName,
+  parts.toArray[CatalogTablePartition], ignoreIfExists = true)
+Seq.empty[Row]
+  }
+
+  @transient private lazy val evalTaskSupport = new 
ForkJoinTaskSupport(new ForkJoinPool(8))
+
+  private def scanPartitions(
+  spark: SparkSession,
+  fs: FileSystem,
+  filter: PathFilter,
+  path: Path,
+  spec: TablePartitionSpec,
+  partitionNames: Seq[String]): GenSeq[(TablePartitionSpec, Path)] = {
+if (partitionNames.length == 0) {
+  return Seq(spec -> path)
+}
+
+val statuses = fs.listStatus(path)
+val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", 
"10").toInt
+val statusPar: GenSeq[FileStatus] =
+  if (partitionNames.length > 1 && statuses.length > threshold || 
partitionNames.length > 2) {
+val parArray = statuses.par
+parArray.tasksupport = evalTaskSupport
+parArray
+  } else {
+statuses
+  }
+statusPar.flatMap { st =>
+  val name = st.getPath.getName
+  if (st.isDirectory && name.contains("=")) {
+val ps = name.split("=", 2)
+val columnName = 
PartitioningUtils.unescapePathName(ps(0)).toLowerCase
+// TODO: Validate the value
+val value = PartitioningUtils.unescapePathName(ps(1))
--- End diff --

If the partitions are generated by Spark, they could be unescape back 
correctly.  For others, they could be compatibility issues. For example, Spark 
does not escape ` ` in Linux, the unescaping for `%20` could be wrong (we could 
show an warning?). I think these are not in the scope of this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@a

[GitHub] spark issue #14566: Make logDir easily copy/paste-able

2016-08-09 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14566
  
OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/14500#discussion_r74100170
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -425,6 +430,110 @@ case class AlterTableDropPartitionCommand(
 
 }
 
+/**
+ * Recover Partitions in ALTER TABLE: recover all the partition in the 
directory of a table and
+ * update the catalog.
+ *
+ * The syntax of this command is:
+ * {{{
+ *   ALTER TABLE table RECOVER PARTITIONS;
+ *   MSCK REPAIR TABLE table;
+ * }}}
+ */
+case class AlterTableRecoverPartitionsCommand(
+tableName: TableIdentifier,
+cmd: String = "ALTER TABLE RECOVER PARTITIONS") extends 
RunnableCommand {
+  override def run(spark: SparkSession): Seq[Row] = {
+val catalog = spark.sessionState.catalog
+if (!catalog.tableExists(tableName)) {
+  throw new AnalysisException(s"Table $tableName in $cmd does not 
exist.")
+}
+val table = catalog.getTableMetadata(tableName)
+if (catalog.isTemporaryTable(tableName)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on temporary tables: $tableName")
+}
+if (DDLUtils.isDatasourceTable(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on datasource tables: $tableName")
+}
+if (table.tableType != CatalogTableType.EXTERNAL) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on external tables: 
$tableName")
+}
+if (!DDLUtils.isTablePartitioned(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on partitioned tables: 
$tableName")
+}
+if (table.storage.locationUri.isEmpty) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on table with location 
provided: $tableName")
+}
+
+val root = new Path(table.storage.locationUri.get)
+val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
+// Dummy jobconf to get to the pathFilter defined in configuration
+// It's very expensive to create a 
JobConf(ClassUtil.findContainingJar() is slow)
+val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration, 
this.getClass)
+val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
+val partitionSpecsAndLocs = scanPartitions(
+  spark, fs, pathFilter, root, Map(), 
table.partitionColumnNames.map(_.toLowerCase))
+val parts = partitionSpecsAndLocs.map { case (spec, location) =>
+  // inherit table storage format (possibly except for location)
+  CatalogTablePartition(spec, table.storage.copy(locationUri = 
Some(location.toUri.toString)))
+}
+spark.sessionState.catalog.createPartitions(tableName,
+  parts.toArray[CatalogTablePartition], ignoreIfExists = true)
+Seq.empty[Row]
+  }
+
+  @transient private lazy val evalTaskSupport = new 
ForkJoinTaskSupport(new ForkJoinPool(8))
+
+  private def scanPartitions(
+  spark: SparkSession,
+  fs: FileSystem,
+  filter: PathFilter,
+  path: Path,
+  spec: TablePartitionSpec,
+  partitionNames: Seq[String]): GenSeq[(TablePartitionSpec, Path)] = {
+if (partitionNames.length == 0) {
+  return Seq(spec -> path)
+}
+
+val statuses = fs.listStatus(path)
+val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", 
"10").toInt
+val statusPar: GenSeq[FileStatus] =
+  if (partitionNames.length > 1 && statuses.length > threshold || 
partitionNames.length > 2) {
+val parArray = statuses.par
--- End diff --

I did not figure out how it work, at least `statuses.par(evalTaskSupport)` 
does not work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread davies

Github user davies commented on the issue:

https://github.com/apache/spark/pull/14500
  
Merging into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14567: Python import reorg

2016-08-09 Thread Stibbons

Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14567#discussion_r74100949
  
--- Diff: python/pyspark/context.py ---
@@ -22,22 +22,30 @@
 import signal
 import sys
 import threading
-from threading import RLock
 from tempfile import NamedTemporaryFile
+from threading import RLock
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
 from pyspark.broadcast import Broadcast
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
-from pyspark.serializers import PickleSerializer, BatchedSerializer, 
UTF8Deserializer, \
-PairDeserializer, AutoBatchedSerializer, NoOpSerializer
-from pyspark.storagelevel import StorageLevel
-from pyspark.rdd import RDD, _load_from_socket, ignore_unicode_prefix
-from pyspark.traceback_utils import CallSite, first_spark_call
+from pyspark.profiler import BasicProfiler
+from pyspark.profiler import ProfilerCollector
+from pyspark.rdd import RDD
+from pyspark.rdd import _load_from_socket
+from pyspark.rdd import ignore_unicode_prefix
+from pyspark.serializers import AutoBatchedSerializer
--- End diff --

indeed this is a deviation from Pep recommendation. I encourage this 
behavior since it simplifies a lot file maintainance.

On our Buildbot based project, we use to have lot of conflict involving 
changes on import statement: on 2 differents branches (say: prod and main), we 
often had to import either the same import (so merge might add this line twice 
when the two developers have placed them in two different places) or not so 
easy to solve conflict (when two developers add theo import two different 
object from the same module).
Once we setup the "one import per line" rule, no more conflict on this 
lines, ever. This helped us a lot automatizing an auto merger from the release 
branches to the "master" branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11157
  
**[Test build #63443 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63443/consoleFull)**
 for PR 11157 at commit 
[`6f7f2d3`](https://github.com/apache/spark/commit/6f7f2d35d6ca8b5a1e3ab71081b4a44d59c54630).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class ShuffleIndexInformation `
  * `public class ShuffleIndexRecord `
  * `case class MonotonicallyIncreasingID() extends LeafExpression with 
Nondeterministic `
  * `case class SparkPartitionID() extends LeafExpression with 
Nondeterministic `
  * `case class AggregateExpression(`
  * `case class Least(children: Seq[Expression]) extends Expression `
  * `case class Greatest(children: Seq[Expression]) extends Expression `
  * `case class CurrentDatabase() extends LeafExpression with Unevaluable `
  * `class GenericInternalRow(val values: Array[Any]) extends 
BaseGenericInternalRow `
  * `class AbstractScalaRowIterator[T] extends Iterator[T] `
  * `class TypedSumDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] `
  * `class TypedSumLong[IN](val f: IN => Long) extends Aggregator[IN, Long, 
Long] `
  * `class TypedCount[IN](val f: IN => Any) extends Aggregator[IN, Long, 
Long] `
  * `class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, 
(Double, Long), Double] `
  * `case class CreateTable(tableDesc: CatalogTable, mode: SaveMode, query: 
Option[LogicalPlan])`
  * `case class PreprocessDDL(conf: SQLConf) extends Rule[LogicalPlan] `
  * `  implicit class SchemaAttribute(f: StructField) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13701: [SPARK-15639][SPARK-16321][SQL] Push down filter at RowG...

2016-08-09 Thread davies

Github user davies commented on the issue:

https://github.com/apache/spark/pull/13701
  
LGTM, could you fix the conflict (should be trivial)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11157
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11157
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63443/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14567: Python import reorg

2016-08-09 Thread Stibbons

Github user Stibbons commented on the issue:

https://github.com/apache/spark/pull/14567
  
Rebased, sorry I had to force push this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14551: [SPARK-16961][CORE] Fixed off-by-one error that b...

2016-08-09 Thread nicklavers

Github user nicklavers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14551#discussion_r74101470
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -874,4 +874,38 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
   }
 }
   }
+
+  test("chi square test of randomizeInPlace") {
+// Parameters
+val arraySize = 10
+val numTrials = 1000
+val threshold = 0.05
+val seed = 1L
+
+// results[i][j]: how many times Utils.randomize moves an element from 
position j to position i
+val results: Array[Array[Long]] = Array.ofDim(arraySize, arraySize)
+
+// This must be seeded because even a fair random process will fail 
this test with
+// probability equal to the value of `threshold`, which is 
inconvenient for a unit test.
+val rand = new java.util.Random(seed)
--- End diff --

scala.util.Random is already imported, but Utils.randomizeInPlace requires 
a java.util.Random. I'm not sure what the right approach is here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14567: Python import reorg

2016-08-09 Thread Stibbons

Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14567#discussion_r74101512
  
--- Diff: python/run-tests.py ---
@@ -37,11 +45,6 @@
 sys.path.append(os.path.join(os.path.dirname(os.path.realpath(__file__)), 
"../dev/"))
 
 
-from sparktestsupport import SPARK_HOME  # noqa (suppress pep8 warnings)
-from sparktestsupport.shellutils import which, subprocess_check_output  # 
noqa
-from sparktestsupport.modules import all_modules  # noqa
--- End diff --

is it a problem if these statements are placed higher in the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14539: [SPARK-16947][SQL] Improve type coercion for inli...

2016-08-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14539#discussion_r74100376
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -769,3 +769,41 @@ case object OneRowRelation extends LeafNode {
*/
   override lazy val statistics: Statistics = Statistics(sizeInBytes = 1)
 }
+
+/**
+ * An inline table that holds a number of foldable expressions, which can 
be materialized into
+ * rows. This is semantically the same as a Union of one row relations.
+ */
+case class InlineTable(rows: Seq[Seq[NamedExpression]]) extends LeafNode {
--- End diff --

Yeah, I had these checks. The thing is that none of the LogicalPlans have 
such logic, it has all been centralized in CheckAnalysis. So I added it there.

It might not be a bad plan to move this functionality into the separate 
plans on the longer run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14567: Python import reorg

2016-08-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14567#discussion_r74101927
  
--- Diff: python/pyspark/context.py ---
@@ -22,22 +22,30 @@
 import signal
 import sys
 import threading
-from threading import RLock
 from tempfile import NamedTemporaryFile
+from threading import RLock
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
 from pyspark.broadcast import Broadcast
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
-from pyspark.serializers import PickleSerializer, BatchedSerializer, 
UTF8Deserializer, \
-PairDeserializer, AutoBatchedSerializer, NoOpSerializer
-from pyspark.storagelevel import StorageLevel
-from pyspark.rdd import RDD, _load_from_socket, ignore_unicode_prefix
-from pyspark.traceback_utils import CallSite, first_spark_call
+from pyspark.profiler import BasicProfiler
+from pyspark.profiler import ProfilerCollector
+from pyspark.rdd import RDD
+from pyspark.rdd import _load_from_socket
+from pyspark.rdd import ignore_unicode_prefix
+from pyspark.serializers import AutoBatchedSerializer
--- End diff --

Yeah we get that in the Scala code for sure, although you get conflicts 
even if you change merely adjacent lines anyway, so it only saves so much 
conflict. It's a decent open question, I wonder what others think? the downside 
is the extra duplication of import boilerplate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13146: [SPARK-13081][PYSPARK][SPARK_SUBMIT]. Allow set p...

2016-08-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13146#discussion_r74101878
  
--- Diff: 
core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java ---
@@ -89,6 +89,11 @@ public void testSparkArgumentHandling() throws Exception 
{
 launcher.setConf("spark.foo", "foo");
 launcher.addSparkArg(opts.CONF, "spark.foo=bar");
 assertEquals("bar", launcher.builder.conf.get("spark.foo"));
+
+launcher.setConf(SparkLauncher.PYSPARK_DRIVER_PYTHON, "python3.4");
+launcher.setConf(SparkLauncher.PYSPARK_PYTHON, "python3.5");
+assertEquals("python3.4", 
launcher.builder.conf.get("spark.pyspark.driver.python"));
--- End diff --

It would be better if you checked `PYSPARK_DRIVER_PYTHON.key()`, but I 
don't know how easy it is to call that from Java. (You could create a tiny 
scala test class for that in `core/src/test/scala/org/apache/spark/launcher`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74102227
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -840,6 +844,7 @@ createExternalTable <- function(x, ...) {
 #'  clause expressions used to split the column 
`partitionColumn` evenly.
 #'  This defaults to SparkContext.defaultParallelism 
when unset.
 #' @param predicates a list of conditions in the where clause; each one 
defines one partition
+#' @param ... additional argument(s) passed to the method.
--- End diff --

in this case I'd reference the same in L829, something like:
" additional JDBC database connection named properties"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13146: [SPARK-13081][PYSPARK][SPARK_SUBMIT]. Allow set p...

2016-08-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13146#discussion_r74102346
  
--- Diff: docs/configuration.md ---
@@ -427,6 +427,22 @@ Apart from these, the following properties are also 
available, and may be useful
 with spark.jars.packages.
   
 
+
+  spark.pyspark.driver.python
+  
+  
+Python binary executable to use for PySpark in driver.
+(default is spark.pyspark.python).
+  
+
+
+  spark.pyspark.python
+  
+  
+Python binary executable to use for PySpark in both driver and 
executors.
+.
--- End diff --

This tag is out of place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13146: [SPARK-13081][PYSPARK][SPARK_SUBMIT]. Allow set p...

2016-08-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13146#discussion_r74102371
  
--- Diff: docs/configuration.md ---
@@ -427,6 +427,22 @@ Apart from these, the following properties are also 
available, and may be useful
 with spark.jars.packages.
   
 
+
+  spark.pyspark.driver.python
+  
+  
+Python binary executable to use for PySpark in driver.
+(default is spark.pyspark.python).
--- End diff --

stray ``


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14567: Python import reorg

2016-08-09 Thread Stibbons

Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14567#discussion_r74102329
  
--- Diff: python/pyspark/context.py ---
@@ -22,22 +22,30 @@
 import signal
 import sys
 import threading
-from threading import RLock
 from tempfile import NamedTemporaryFile
+from threading import RLock
 
 from pyspark import accumulators
 from pyspark.accumulators import Accumulator
 from pyspark.broadcast import Broadcast
 from pyspark.conf import SparkConf
 from pyspark.files import SparkFiles
 from pyspark.java_gateway import launch_gateway
-from pyspark.serializers import PickleSerializer, BatchedSerializer, 
UTF8Deserializer, \
-PairDeserializer, AutoBatchedSerializer, NoOpSerializer
-from pyspark.storagelevel import StorageLevel
-from pyspark.rdd import RDD, _load_from_socket, ignore_unicode_prefix
-from pyspark.traceback_utils import CallSite, first_spark_call
+from pyspark.profiler import BasicProfiler
+from pyspark.profiler import ProfilerCollector
+from pyspark.rdd import RDD
+from pyspark.rdd import _load_from_socket
+from pyspark.rdd import ignore_unicode_prefix
+from pyspark.serializers import AutoBatchedSerializer
--- End diff --

As you can see in https://pypi.python.org/pypi/isort, there a several way 
to format this multiple import statements. At worst, at least I recommende to 
enforce the sort of this multi import lines so there is no ambiguity where to 
place any "import" (and isort will correct the change from the developer)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74102403
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -727,6 +729,7 @@ dropTempView <- function(viewName) {
 #' @param source The name of external data source
 #' @param schema The data schema defined in structType
 #' @param na.strings Default string value for NA when source is "csv"
+#' @param ... additional argument(s) passed to the method.
--- End diff --

something like:
"additional external data source specific named properties"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13146: [SPARK-13081][PYSPARK][SPARK_SUBMIT]. Allow set p...

2016-08-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13146#discussion_r74102459
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
@@ -37,8 +38,11 @@ object PythonRunner {
 val pythonFile = args(0)
 val pyFiles = args(1)
 val otherArgs = args.slice(2, args.length)
-val pythonExec =
-  sys.env.getOrElse("PYSPARK_DRIVER_PYTHON", 
sys.env.getOrElse("PYSPARK_PYTHON", "python"))
+val sparkConf = new SparkConf()
+val pythonExec = sparkConf.get(PYSPARK_DRIVER_PYTHON)
--- End diff --

I find the version I suggested earlier (using `orElse` instead of nested 
`getOrElse` calls) more readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74103028
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -257,23 +257,24 @@ createDataFrame.default <- function(data, schema = 
NULL, samplingRatio = 1.0) {
 }
 
 createDataFrame <- function(x, ...) {
-  dispatchFunc("createDataFrame(data, schema = NULL, samplingRatio = 
1.0)", x, ...)
+  dispatchFunc("createDataFrame(data, schema = NULL)", x, ...)
 }
 
 #' @rdname createDataFrame
 #' @aliases createDataFrame
 #' @export
 #' @method as.DataFrame default
 #' @note as.DataFrame since 1.6.0
-as.DataFrame.default <- function(data, schema = NULL, samplingRatio = 1.0) 
{
--- End diff --

the `.default` methods are here for backward compatibility. 
(please see  SPARK-16693/PR #14330)
I don't think we should change the signature - or it will break any 
existing user. perhaps just add the @param and say "Currently not use"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13146: [SPARK-13081][PYSPARK][SPARK_SUBMIT]. Allow set pythonEx...

2016-08-09 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/13146
  
A few minor things, otherwise LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74103250
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3184,6 +3185,7 @@ setMethod("histogram",
 #' @param x A SparkDataFrame
 #' @param url JDBC database url of the form `jdbc:subprotocol:subname`
 #' @param tableName The name of the table in the external database
+#' @param ... additional argument(s) passed to the method
--- End diff --

ditto, something like "additional JDBC database connection properties"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74104039
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3049,8 +3050,8 @@ setMethod("drop",
 #'
 #' @name histogram
 #' @param nbins the number of bins (optional). Default value is 10.
+#' @param col the column (described by character or Column object) to 
build the histogram from.
--- End diff --

we generally don't say "Column object" - object system in R is a bit 
different and I think here "S4 class" would make more sense?
but I'd suggest simplifying it to:
"@param col the column as Character string or a Column to build the 
histogram from"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable credential...

2016-08-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r74104234
  
--- Diff: docs/running-on-yarn.md ---
@@ -525,16 +524,23 @@ token for the cluster's HDFS filesystem, and 
potentially for HBase and Hive.
 
 An HBase token will be obtained if HBase is in on classpath, the HBase 
configuration declares
 the application is secure (i.e. `hbase-site.xml` sets 
`hbase.security.authentication` to `kerberos`),
-and `spark.yarn.security.tokens.hbase.enabled` is not set to `false`.
+and `spark.yarn.security.credentials.hbase.enabled` is not set to `false`.
 
 Similarly, a Hive token will be obtained if Hive is on the classpath, its 
configuration
 includes a URI of the metadata store in `"hive.metastore.uris`, and
-`spark.yarn.security.tokens.hive.enabled` is not set to `false`.
+`spark.yarn.security.credentials.hive.enabled` is not set to `false`.
 
 If an application needs to interact with other secure HDFS clusters, then
 the tokens needed to access these clusters must be explicitly requested at
 launch time. This is done by listing them in the 
`spark.yarn.access.namenodes` property.
 
+Spark supports integrating with other security-aware services through Java 
Services mechanism (see
+`java.util.ServiceLoader`). To do that, implementations of 
`org.apache.spark.deploy.yarn.security.ServiceCredentialProvider`
+should be available to Spark by listing their names in the corresponding 
file in the jar's
+`META-INF/services` directory. These plug-ins can be disabled by setting
+`spark.yarn.security.tokens.{service}.enabled` to `false`, where 
`{service}` is the name of
+credential provider.
+
 ```
 spark.yarn.access.namenodes 
hdfs://ireland.example.org:8020/,hdfs://frankfurt.example.org:8020/
--- End diff --

Hmm, it seems like this example should be before the paragraph you're 
adding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74104522
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2845,8 +2844,11 @@ setMethod("fillna",
 #' Since data.frames are held in memory, ensure that you have enough memory
 #' in your system to accommodate the contents.
 #'
-#' @param x a SparkDataFrame
-#' @return a data.frame
+#' @param x a SparkDataFrame.
+#' @param row.names NULL or a character vector giving the row names for 
the data frame.
+#' @param optional If `TRUE`, converting column names is optional.
+#' @param ... additional arguments passed to the method.
--- End diff --

in this case "additional arguments to pass to base::as.data.frame"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74104886
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1146,7 +1147,7 @@ setMethod("head",
 
 #' Return the first row of a SparkDataFrame
 #'
-#' @param x A SparkDataFrame
--- End diff --

I think similar to what you have for other functions, this could go to 
generic.R - do you have any other idea how to have functions working with 
multiple class documented?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74105069
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -177,11 +176,10 @@ setMethod("isLocal",
 #'
 #' Print the first numRows rows of a SparkDataFrame
 #'
-#' @param x A SparkDataFrame
-#' @param numRows The number of rows to print. Defaults to 20.
-#' @param truncate Whether truncate long strings. If true, strings more 
than 20 characters will be
-#' truncated and all cells will be aligned right
-#'
+#' @param numRows the number of rows to print. Defaults to 20.
+#' @param truncate whether truncate long strings. If true, strings more 
than 20 characters will be
--- End diff --

true -> TRUE


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...

2016-08-09 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14558
  
Thank you for working on this. I've done a pass and added my notes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14346: [SPARK-16710] [SparkR] [ML] spark.glm should support wei...

2016-08-09 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14346
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14065: [SPARK-14743][YARN] Add a configurable credential manage...

2016-08-09 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/14065
  
Looks fine. There are some possible enhancements (e.g. what looks like some 
code repetition in the HDFS provider, neither Hive nor HBase return a token 
renewal time, etc) but those can be done separately.

@tgravescs did you have any remaining comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14065: [SPARK-14743][YARN] Add a configurable credential manage...

2016-08-09 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/14065
  
all my original comments were addressed and I won't have time to do another 
review until next week so I'm good with it if you are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14551: [SPARK-16961][CORE] Fixed off-by-one error that b...

2016-08-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14551#discussion_r74108468
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -874,4 +874,38 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
   }
 }
   }
+
+  test("chi square test of randomizeInPlace") {
+// Parameters
+val arraySize = 10
+val numTrials = 1000
+val threshold = 0.05
+val seed = 1L
+
+// results[i][j]: how many times Utils.randomize moves an element from 
position j to position i
+val results: Array[Array[Long]] = Array.ofDim(arraySize, arraySize)
+
+// This must be seeded because even a fair random process will fail 
this test with
+// probability equal to the value of `threshold`, which is 
inconvenient for a unit test.
+val rand = new java.util.Random(seed)
--- End diff --

Ah right, never mind me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY...

2016-08-09 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14544#discussion_r74109736
  
--- Diff: docs/spark-standalone.md ---
@@ -196,6 +196,21 @@ SPARK_MASTER_OPTS supports the following system 
properties:
   
 
 
+  spark.deploy.maxExecutorRetries
+  10
+  
+Limit on the maximum number of back-to-back executor failures that can 
occur before the
+standalone cluster manager removes a faulty application. An 
application will never be removed
+if it has any running executors. If an application experiences more 
than
+spark.deploy.maxExecutorRetries failures in a row, no 
executors
--- End diff --

Does "in a row" mean anything here?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY...

2016-08-09 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14544#discussion_r74109798
  
--- Diff: docs/spark-standalone.md ---
@@ -196,6 +196,21 @@ SPARK_MASTER_OPTS supports the following system 
properties:
   
 
 
+  spark.deploy.maxExecutorRetries
+  10
+  
+Limit on the maximum number of back-to-back executor failures that can 
occur before the
+standalone cluster manager removes a faulty application. An 
application will never be removed
+if it has any running executors. If an application experiences more 
than
+spark.deploy.maxExecutorRetries failures in a row, no 
executors
+successfully start running in between those failures, and the 
application has no running
+executors then the standalone cluster manager will remove the 
application and mark it as failed.
+To disable this automatic removal, set 
spark.deploy.maxExecutorRetries to
+-1
--- End diff --

add a period at the end.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY config...

2016-08-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14544
  
LGTM too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread skonto

Github user skonto commented on the issue:

https://github.com/apache/spark/pull/11157
  
@mgummelt pls review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY...

2016-08-09 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14544#discussion_r74110547
  
--- Diff: docs/spark-standalone.md ---
@@ -196,6 +196,21 @@ SPARK_MASTER_OPTS supports the following system 
properties:
   
 
 
+  spark.deploy.maxExecutorRetries
+  10
+  
+Limit on the maximum number of back-to-back executor failures that can 
occur before the
+standalone cluster manager removes a faulty application. An 
application will never be removed
+if it has any running executors. If an application experiences more 
than
+spark.deploy.maxExecutorRetries failures in a row, no 
executors
--- End diff --

Yes: if you have a sequence of executor events like FAIL RUNNING FAIL 
RUNNING ... then this resets the retry count, whereas FAIL FAIL FAIL FAIL... 
increments it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14566: Make logDir easily copy/paste-able

2016-08-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14566
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11157
  
**[Test build #63445 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63445/consoleFull)**
 for PR 11157 at commit 
[`4ff5de4`](https://github.com/apache/spark/commit/4ff5de4500b841f405fccd9edaed0370f88dcd65).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14567: Python import reorg

2016-08-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14567
  
Can you create a JIRA ticket for this? This is too large to go in without a 
JIRA ticket.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread sameeragarwal

Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14500#discussion_r74110917
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -425,6 +430,111 @@ case class AlterTableDropPartitionCommand(
 
 }
 
+/**
+ * Recover Partitions in ALTER TABLE: recover all the partition in the 
directory of a table and
+ * update the catalog.
+ *
+ * The syntax of this command is:
+ * {{{
+ *   ALTER TABLE table RECOVER PARTITIONS;
+ *   MSCK REPAIR TABLE table;
+ * }}}
+ */
+case class AlterTableRecoverPartitionsCommand(
+tableName: TableIdentifier,
+cmd: String = "ALTER TABLE RECOVER PARTITIONS") extends 
RunnableCommand {
+  override def run(spark: SparkSession): Seq[Row] = {
+val catalog = spark.sessionState.catalog
+if (!catalog.tableExists(tableName)) {
+  throw new AnalysisException(s"Table $tableName in $cmd does not 
exist.")
+}
+val table = catalog.getTableMetadata(tableName)
+if (catalog.isTemporaryTable(tableName)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on temporary tables: $tableName")
+}
+if (DDLUtils.isDatasourceTable(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd on datasource tables: $tableName")
+}
+if (table.tableType != CatalogTableType.EXTERNAL) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on external tables: 
$tableName")
+}
+if (!DDLUtils.isTablePartitioned(table)) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on partitioned tables: 
$tableName")
+}
+if (table.storage.locationUri.isEmpty) {
+  throw new AnalysisException(
+s"Operation not allowed: $cmd only works on table with location 
provided: $tableName")
+}
+
+val root = new Path(table.storage.locationUri.get)
+val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
+// Dummy jobconf to get to the pathFilter defined in configuration
+// It's very expensive to create a 
JobConf(ClassUtil.findContainingJar() is slow)
+val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration, 
this.getClass)
+val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
+val partitionSpecsAndLocs = scanPartitions(
+  spark, fs, pathFilter, root, Map(), 
table.partitionColumnNames.map(_.toLowerCase))
+val parts = partitionSpecsAndLocs.map { case (spec, location) =>
+  // inherit table storage format (possibly except for location)
+  CatalogTablePartition(spec, table.storage.copy(locationUri = 
Some(location.toUri.toString)))
+}
+spark.sessionState.catalog.createPartitions(tableName,
+  parts.toArray[CatalogTablePartition], ignoreIfExists = true)
+Seq.empty[Row]
+  }
+
+  @transient private lazy val evalTaskSupport = new 
ForkJoinTaskSupport(new ForkJoinPool(8))
+
+  private def scanPartitions(
+  spark: SparkSession,
+  fs: FileSystem,
+  filter: PathFilter,
+  path: Path,
+  spec: TablePartitionSpec,
+  partitionNames: Seq[String]): GenSeq[(TablePartitionSpec, Path)] = {
+if (partitionNames.length == 0) {
+  return Seq(spec -> path)
+}
+
+val statuses = fs.listStatus(path)
+val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", 
"10").toInt
+val statusPar: GenSeq[FileStatus] =
+  if (partitionNames.length > 1 && statuses.length > threshold || 
partitionNames.length > 2) {
+val parArray = statuses.par
+parArray.tasksupport = evalTaskSupport
+parArray
+  } else {
+statuses
+  }
+statusPar.flatMap { st =>
+  val name = st.getPath.getName
+  if (st.isDirectory && name.contains("=")) {
+val ps = name.split("=", 2)
+val columnName = 
PartitioningUtils.unescapePathName(ps(0)).toLowerCase
+// TODO: Validate the value
+val value = PartitioningUtils.unescapePathName(ps(1))
--- End diff --

yes, that makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14542: [SPARK-16930][yarn] Fix a couple of races in clus...

2016-08-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14542#discussion_r74111478
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -404,7 +410,8 @@ private[spark] class ApplicationMaster(
   clientMode = true)
 val driverRef = waitForSparkDriver()
 addAmIpFilter()
-registerAM(rpcEnv, driverRef, 
sparkConf.get("spark.driver.appUIAddress", ""), securityMgr)
+registerAM(sparkConf, rpcEnv, driverRef, 
sparkConf.get("spark.driver.appUIAddress", ""),
--- End diff --

Maybe, but that's an unrelated change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14468: [SPARK-16671][core][sql] Consolidate code to do variable...

2016-08-09 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/14468
  
Friendly ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14567: Python import reorg

2016-08-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14567
  
BTW this is actually a non-trivial change and would require very careful 
look, since Python imports are not side effect free. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY config...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14544
  
**[Test build #63446 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63446/consoleFull)**
 for PR 14544 at commit 
[`ef3297d`](https://github.com/apache/spark/commit/ef3297d374d34a3f5d55f83d2968e836b7e0b5d4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11157
  
**[Test build #63448 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63448/consoleFull)**
 for PR 11157 at commit 
[`2b192a1`](https://github.com/apache/spark/commit/2b192a19abcfc55bb5f8174bbbe6bdb60a695323).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14547
  
**[Test build #63447 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63447/consoleFull)**
 for PR 14547 at commit 
[`fe256f7`](https://github.com/apache/spark/commit/fe256f736a3a11625b6c3983a1a27ef9c5543280).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14384: [Spark-16443][SparkR] Alternating Least Squares (...

2016-08-09 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14384#discussion_r74112930
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +642,147 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+
+#' Alternating Least Squares (ALS) for Collaborative Filtering
+#'
+#' \code{spark.als} learns latent factors in collaborative filtering via 
alternating least
+#' squares. Users can call \code{summary} to obtain fitted latent factors, 
\code{predict}
+#' to make predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#'
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/ml-collaborative-filtering.html}{MLlib:
+#' Collaborative Filtering}.
+#' Additional arguments can be passed to the methods.
+#' \describe{
+#'\item{nonnegative}{logical value indicating whether to apply 
nonnegativity constraints.
+#'   Default: FALSE}
+#'\item{implicitPrefs}{logical value indicating whether to use 
implicit preference.
+#' Default: FALSE}
+#'\item{alpha}{alpha parameter in the implicit preference formulation 
(>= 0). Default: 1.0}
+#'\item{seed}{integer seed for random number generation. Default: 0}
+#'\item{numUserBlocks}{number of user blocks used to parallelize 
computation (> 0).
+#' Default: 10}
+#'\item{numItemBlocks}{number of item blocks used to parallelize 
computation (> 0).
+#' Default: 10}
+#'\item{checkpointInterval}{number of checkpoint intervals (>= 1) or 
disable checkpoint (-1).
+#'  Default: 10}
+#'}
+#'
+#' @param data A SparkDataFrame for training
+#' @param ratingCol column name for ratings
+#' @param userCol column name for user ids. Ids must be (or can be coerced 
into) integers
+#' @param itemCol column name for item ids. Ids must be (or can be coerced 
into) integers
+#' @param rank rank of the matrix factorization (> 0)
+#' @param reg regularization parameter (>= 0)
+#' @param maxIter maximum number of iterations (>= 0)
+
+#' @return \code{spark.als} returns a fitted ALS model
+#' @rdname spark.als
+#' @aliases spark.als,SparkDataFrame
+#' @name spark.als
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(ratings)
+#' model <- spark.als(df, "rating", "user", "item")
+#'
+#' # extract latent factors
+#' stats <- summary(model)
+#' userFactors <- stats$userFactors
+#' itemFactors <- stats$itemFactors
+#'
+#' # make predictions
+#' predicted <- predict(model, df)
+#' showDF(predicted)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#'
+#' # set other arguments
+#' modelS <- spark.als(df, "rating", "user", "item", rank = 20,
+#' reg = 0.1, nonnegative = TRUE)
+#' statsS <- summary(modelS)
+#' }
+#' @note spark.als since 2.1.0
+setMethod("spark.als", signature(data = "SparkDataFrame"),
+  function(data, ratingCol = "rating", userCol = "user", itemCol = 
"item",
+   rank = 10, reg = 1.0, maxIter = 10, ...) {
+
+`%||%` <- function(a, b) if (!is.null(a)) a else b
--- End diff --

In this case (since we set 7 default values) the code would be a little 
repetitive if we expand every one of them. I guess this definition would not 
have side effect to the other functions in SparkR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14560: [SPARK-16971][SQL] Strip trailing zeros for decimal's st...

2016-08-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14560
  
hm this is not what postgres does

```
rxin=# select cast(2.01 as numeric(40, 20));
numeric 

 2.0100
(1 row)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14560: [SPARK-16971][SQL] Strip trailing zeros for decimal's st...

2016-08-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14560
  
I'm going to mark this as won't fix. If it is a decimal type, I actually 
expect it to show me all the 0s.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11157
  
**[Test build #63449 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63449/consoleFull)**
 for PR 11157 at commit 
[`af9b19b`](https://github.com/apache/spark/commit/af9b19bfb055c4ec9d7b9449526c1efb9b089c3f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14566: Make logDir easily copy/paste-able

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14566
  
**[Test build #63444 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63444/consoleFull)**
 for PR 14566 at commit 
[`ed3292d`](https://github.com/apache/spark/commit/ed3292dd1a67f1d67d8e6a9ae1d02481f39fe4fd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14566: Make logDir easily copy/paste-able

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14566
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63444/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14566: Make logDir easily copy/paste-able

2016-08-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14566
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY config...

2016-08-09 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/14544
  
Merging to master, branch-2.0, and branch-1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11157
  
**[Test build #63450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63450/consoleFull)**
 for PR 11157 at commit 
[`50a34b8`](https://github.com/apache/spark/commit/50a34b81248f6fc621d9a81ce8a5cc9d98b796d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14384: [Spark-16443][SparkR] Alternating Least Squares (...

2016-08-09 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14384#discussion_r74116056
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +642,147 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+
+#' Alternating Least Squares (ALS) for Collaborative Filtering
+#'
+#' \code{spark.als} learns latent factors in collaborative filtering via 
alternating least
+#' squares. Users can call \code{summary} to obtain fitted latent factors, 
\code{predict}
+#' to make predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#'
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/ml-collaborative-filtering.html}{MLlib:
+#' Collaborative Filtering}.
+#' Additional arguments can be passed to the methods.
+#' \describe{
+#'\item{nonnegative}{logical value indicating whether to apply 
nonnegativity constraints.
+#'   Default: FALSE}
+#'\item{implicitPrefs}{logical value indicating whether to use 
implicit preference.
+#' Default: FALSE}
+#'\item{alpha}{alpha parameter in the implicit preference formulation 
(>= 0). Default: 1.0}
+#'\item{seed}{integer seed for random number generation. Default: 0}
+#'\item{numUserBlocks}{number of user blocks used to parallelize 
computation (> 0).
+#' Default: 10}
+#'\item{numItemBlocks}{number of item blocks used to parallelize 
computation (> 0).
+#' Default: 10}
+#'\item{checkpointInterval}{number of checkpoint intervals (>= 1) or 
disable checkpoint (-1).
+#'  Default: 10}
+#'}
+#'
+#' @param data A SparkDataFrame for training
+#' @param ratingCol column name for ratings
+#' @param userCol column name for user ids. Ids must be (or can be coerced 
into) integers
+#' @param itemCol column name for item ids. Ids must be (or can be coerced 
into) integers
+#' @param rank rank of the matrix factorization (> 0)
+#' @param reg regularization parameter (>= 0)
+#' @param maxIter maximum number of iterations (>= 0)
+
+#' @return \code{spark.als} returns a fitted ALS model
+#' @rdname spark.als
+#' @aliases spark.als,SparkDataFrame
+#' @name spark.als
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(ratings)
+#' model <- spark.als(df, "rating", "user", "item")
+#'
+#' # extract latent factors
+#' stats <- summary(model)
+#' userFactors <- stats$userFactors
+#' itemFactors <- stats$itemFactors
+#'
+#' # make predictions
+#' predicted <- predict(model, df)
+#' showDF(predicted)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#'
+#' # set other arguments
+#' modelS <- spark.als(df, "rating", "user", "item", rank = 20,
+#' reg = 0.1, nonnegative = TRUE)
+#' statsS <- summary(modelS)
+#' }
+#' @note spark.als since 2.1.0
+setMethod("spark.als", signature(data = "SparkDataFrame"),
+  function(data, ratingCol = "rating", userCol = "user", itemCol = 
"item",
+   rank = 10, reg = 1.0, maxIter = 10, ...) {
+
+`%||%` <- function(a, b) if (!is.null(a)) a else b
+
+args <- list(...)
+numUserBlocks <- args$numUserBlocks %||% 10
+numItemBlocks <- args$numItemBlocks %||% 10
+implicitPrefs <- args$implicitPrefs %||% FALSE
+alpha <- args$alpha %||% 1.0
+nonnegative <- args$nonnegative %||% FALSE
+checkpointInterval <- args$checkpointInterval %||% 10
+seed <- args$seed %||% 0
+
+features <- array(c(ratingCol, userCol, itemCol))
+distParams <- array(as.integer(c(numUserBlocks, numItemBlocks,
+ checkpointInterval, seed)))
+
+jobj <- callJStatic("org.apache.spark.ml.r.ALSWrapper",
+"fit", data@sdf, features, 
as.integer(rank),
+reg, as.integer(maxIter), implicitPrefs, 
alpha, nonnegative,
--- End diff --

Done. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-

[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...

2016-08-09 Thread skonto

Github user skonto commented on the issue:

https://github.com/apache/spark/pull/11157
  
Do we need to make any changes to the documentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14500: [SPARK-16905] SQL DDL: MSCK REPAIR TABLE

2016-08-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14500


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14544: [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY...

2016-08-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14544


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14384: [Spark-16443][SparkR] Alternating Least Squares (ALS) wr...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14384
  
**[Test build #63451 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63451/consoleFull)**
 for PR 14384 at commit 
[`5535736`](https://github.com/apache/spark/commit/5535736b2588f4d488a4f762d7f5f237d56cede1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution in CTE by ...

2016-08-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14452
  
Will the deduplication logics on conflicting attributes in Analyzer affect 
your solution? 


https://github.com/apache/spark/blob/06f5dc841517e7156f5f445655d97ba541ebbd7e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L478-L482


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63452 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63452/consoleFull)**
 for PR 14182 at commit 
[`3742490`](https://github.com/apache/spark/commit/37424904318b55e13b63d14fcad7ac5fbd82a4c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14175: [SPARK-16522][MESOS] Spark application throws exception ...

2016-08-09 Thread mgummelt

Github user mgummelt commented on the issue:

https://github.com/apache/spark/pull/14175
  
@sun-rui Let me know if you are unable to do so.  We need this in 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63452 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63452/consoleFull)**
 for PR 14182 at commit 
[`3742490`](https://github.com/apache/spark/commit/37424904318b55e13b63d14fcad7ac5fbd82a4c7).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 >

201 - 300 of 768 matches

Mail list logo