date:20160707

[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14100
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14100
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61945/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14100
  
**[Test build #61945 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61945/consoleFull)**
 for PR 14100 at commit 
[`e00bc53`](https://github.com/apache/spark/commit/e00bc53cc9835104045a5c451a058c77d84f382c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/13701
  
@gatorsmile This is the benchmark results. No significant difference.

Before this patch:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Parquet reader:  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative


reading Parquet file   855 / 1162  2.4  
   417.2   1.0X

After this patch:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Parquet reader:  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative


reading Parquet file   874 / 1228  2.3  
   426.7   1.0X



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-07 Thread janplus

Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r70018330
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
--- End diff --

Complication with the error
`Error:(686, 9) annotation argument needs to be a constant; found: 
scala.this.Predef.augmentString`
So I should probably remain the current extended description string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13890: [SPARK-16189][SQL] Add ExternalRDD logical plan for inpu...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13890
  
**[Test build #61950 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61950/consoleFull)**
 for PR 13890 at commit 
[`e218f5f`](https://github.com/apache/spark/commit/e218f5fbef996ca3e9606cc68bf433c83ebf224e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14099
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61944/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14099
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14099
  
**[Test build #61944 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61944/consoleFull)**
 for PR 14099 at commit 
[`9ce8146`](https://github.com/apache/spark/commit/9ce81468b41c7515c6ce3fd4791e0ecc03a7de19).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

2016-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14071#discussion_r70017144
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
@@ -162,25 +147,28 @@ private[hive] case class MetastoreRelation(
   tPartition.setTableName(tableName)
   tPartition.setValues(partitionKeys.map(a => p.spec(a.name)).asJava)
 
-  val sd = new org.apache.hadoop.hive.metastore.api.StorageDescriptor()
+  val sd = toHiveStorage(p.storage, 
catalogTable.schema.map(toHiveColumn))
   tPartition.setSd(sd)
-  sd.setCols(catalogTable.schema.map(toHiveColumn).asJava)
-  p.storage.locationUri.foreach(sd.setLocation)
-  p.storage.inputFormat.foreach(sd.setInputFormat)
-  p.storage.outputFormat.foreach(sd.setOutputFormat)
+  new Partition(hiveQlTable, tPartition)
+}
+  }
 
-  val serdeInfo = new org.apache.hadoop.hive.metastore.api.SerDeInfo
-  sd.setSerdeInfo(serdeInfo)
-  // maps and lists should be set only after all elements are ready 
(see HIVE-7975)
-  p.storage.serde.foreach(serdeInfo.setSerializationLib)
+  private def toHiveStorage(storage: CatalogStorageFormat, schema: 
Seq[FieldSchema]) = {
--- End diff --

This function has a bug, right? `storage` is not being used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14028
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61943/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14028
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14028
  
**[Test build #61943 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61943/consoleFull)**
 for PR 14028 at commit 
[`6570a98`](https://github.com/apache/spark/commit/6570a9874e60ecb9366ea37a0e5dfe06b821dc62).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14102: [SPARK-16434][SQL][WIP] Avoid record-per type dispatch i...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14102
  
**[Test build #61949 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61949/consoleFull)**
 for PR 14102 at commit 
[`2d77f66`](https://github.com/apache/spark/commit/2d77f66f2c78bb139212011bfa1fa2efbf6b9d5b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13991: [SPARK-16318][SQL] Implement various xpath functions

2016-07-07 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13991
  
Also - rather than having concrete implementations for all of these, why 
don't we use RuntimeReplaceable?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14082
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61946/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14082
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14082
  
**[Test build #61946 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61946/consoleFull)**
 for PR 14082 at commit 
[`1af09f3`](https://github.com/apache/spark/commit/1af09f31fd506143e8fe45b530dd46e39df76d6b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

2016-07-07 Thread husseinhazimeh

Github user husseinhazimeh commented on the issue:

https://github.com/apache/spark/pull/14101
  
@mengxr @sethah can you review this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14102: [SPARK-16434][SQL][WIP] Avoid record-per type dispatch i...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14102
  
**[Test build #61948 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61948/consoleFull)**
 for PR 14102 at commit 
[`74fa944`](https://github.com/apache/spark/commit/74fa944209491b9884dbfc8b71e56e36b45e28a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14102: [SPARK-16434][SQL][WIP] Avoid record-per type dis...

2016-07-07 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/14102

[SPARK-16434][SQL][WIP] Avoid record-per type dispatch in JSON when reading

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-16434

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14102.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14102


commit 74fa944209491b9884dbfc8b71e56e36b45e28a4
Author: hyukjinkwon 
Date:   2016-07-08T01:14:18Z

Avoid record-per type dispatch in JSON when reading




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14101
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...

2016-07-07 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/14065
  
Thanks a lot @tgravescs and @vanzin for your suggestions, I will change the 
codes accordingly, greatly appreciate your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14101: [SPARK-16431] [ML] Add a unified method that acce...

2016-07-07 Thread husseinhazimeh

GitHub user husseinhazimeh opened a pull request:

https://github.com/apache/spark/pull/14101

[SPARK-16431] [ML] Add a unified method that accepts single instances to 
feature transformers and predictors

## What changes were proposed in this pull request?
Current feature transformers in spark.ml can only operate on DataFrames and 
don't have a method that accepts single instances. A typical transformer has a 
User-Defined Function (udf) in its `transform` method which includes a set of 
operations on the features of a single instance:

```
val column_operation = udf {operations on single instance}
```

Adding a new method called `transformInstance` that operates directly on 
single instances and using it in the udf instead can be useful:

```
def transformInstance(features: featuresType): OutputType = {operations on 
single instance}

val column_operation = udf {transformInstance}
```

Predictors also don't have a public method that does predictions on single 
instances. `transformInstance` can be easily added to predictors by acting as a 
wrapper for the internal method predict (which takes features as input).

Note: The proposed method in this change is added to all predictors and 
feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which 
might require bigger changes due to dependencies on the dataset's schema (they 
can be fixed using simple hacks but this needs to be discussed)

## Benefits

1. Providing a low-latency transformation/prediction method to support 
machine learning applications that require real-time predictions. The current 
`transform` method has a relatively high latency when transforming single 
instances or small batches due to the overhead introduced by DataFrame 
operations. I measured the latency required to classify a single instance in 
the 20 Newsgroups dataset using the current `transform` method and the proposed 
`transformInstance`.  The ML pipeline contains a tokenizer, stopword remover, 
TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the 
latency percentiles in milliseconds after measuring the time to classify 700 
documents.

 Transformation Method | P50 | P90 | P99 | Max
 - | --- | --- | --- | ---
 transform | 31.44 | 39.43 | 67.75 | 126.97
 transformInstance | 0.16 | 0.38 | 1.16 | 3.2

 `transformInstance` is 200 times faster on average and can classify a 
document in less than a millisecond.  By profiling the code of `transform`, it 
turns out that every transformer in the pipeline wastes 5 milliseconds on 
average in DataFrame-related operations when transforming a single instance. 
This implies that the latency increases linearly with the pipeline size which 
can be problematic.
 
2. Increasing code readability and allowing easier debugging as operations 
on rows are now combined into a function that can be tested independently of 
the higher-level `transform` method.

3. Adding flexibility to create new models: for example, check this 
[comment](https://github.com/apache/spark/pull/8883#issuecomment-215559305) on 
supporting new ensemble methods.

## How was this patch tested?
The current tests for transformers and predictors, which invoke 
`transformInstance` internally, passed. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/husseinhazimeh/spark lowlatency

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14101


commit e8b3de1e599225fa71fecc17aaa34998863fb38b
Author: Hussein Hazimeh 
Date:   2016-07-07T20:50:22Z

Add transformInstance method to predictors and transformers

commit ca213e338bde7da2e308b2ffd9c3fa1b5d26122e
Author: Hussein Hazimeh 
Date:   2016-07-07T21:03:46Z

Update LogisticRegression.scala

commit 1fe5b18a0519d324ed53108ddd809a421a811f50
Author: Hussein Hazimeh 
Date:   2016-07-07T21:21:45Z

Update HashingTF.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

2016-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14071#discussion_r70012893
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -403,17 +400,18 @@ object CreateDataSourceTableUtils extends Logging {
   assert(partitionColumns.isEmpty)
   assert(relation.partitionSchema.isEmpty)
 
+  var storage = CatalogStorageFormat(
+locationUri = None,
--- End diff --

oh it's a mistake...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14094
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14094
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61936/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14094
  
**[Test build #61936 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61936/consoleFull)**
 for PR 14094 at commit 
[`9663b42`](https://github.com/apache/spark/commit/9663b429c5da0e7967530cea82fb372221d9741f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70012365
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
+# $example off:create_DataFrames$
+
+
+# $example on:untyped_transformations$
+# Create the DataFrame
+df <- read.json("examples/src/main/resources/people.json")
+
+# Show the content of the DataFrame
+showDF(df)
+## age  name
+## null Michael
+## 30   Andy
+## 19   Justin
+
+# Print the schema in a tree format
+printSchema(df)
+## root
+## |-- age: long (nullable = true)
+## |-- name: string (nullable = true)
+
+# Select only the "name" column
+showDF(select(df, "name"))
+## name
+## Michael
+## Andy
+## Justin
+
+# Select everybody, but increment the age by 1
+showDF(select(df, df$name, df$age + 1))
+## name(age + 1)
+## Michael null
+## Andy31
+## Justin  20
+
+# Select people older than 21
+showDF(where(df, df$age > 21))
+## age name
+## 30  Andy
+
+# Count people by age
+showDF(count(groupBy(df, "age")))
+## age  count
+## null 1
+## 19   1
+## 30   1
+# $example off:untyped_transformations$
+
+
+# $example on:sql_query$
+df <- sql("SELECT * FROM table")
--- End diff --

Lets register `df` from above using `createExternalTable` and then run the 
query. We should aim for a case where this R file should be executable on its 
own


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13991: [SPARK-16318][SQL] Implement various xpath functions

2016-07-07 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13991
  
For this one I think we should consider supporting only foldable literals 
for the path component.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14100
  
**[Test build #61945 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61945/consoleFull)**
 for PR 14100 at commit 
[`e00bc53`](https://github.com/apache/spark/commit/e00bc53cc9835104045a5c451a058c77d84f382c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14082
  
**[Test build #61946 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61946/consoleFull)**
 for PR 14082 at commit 
[`1af09f3`](https://github.com/apache/spark/commit/1af09f31fd506143e8fe45b530dd46e39df76d6b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13991: [SPARK-16318][SQL] Implement various xpath functions

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13991
  
**[Test build #61947 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61947/consoleFull)**
 for PR 13991 at commit 
[`2d48ae5`](https://github.com/apache/spark/commit/2d48ae553736928cb4003a530608513d6f3307ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...

2016-07-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14096


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain ...

2016-07-07 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/14100

[SPARK-16433][SQL]Improve StreamingQuery.explain when no data arrives

## What changes were proposed in this pull request?

Display `No physical plan. Waiting for data.` instead of `N/A`  for 
StreamingQuery.explain when no data arrives because `N/A` doesn't provide 
meaningful information.

## How was this patch tested?

Existing unit tests.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-16433

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14100


commit e00bc53cc9835104045a5c451a058c77d84f382c
Author: Shixiong Zhu 
Date:   2016-07-08T00:45:24Z

Improve StreamingQuery.explain when no data arrives




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14096
  
LGTM. Thanks @dongjoon-hyun -- Merging this to master, branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
Hi, @shivaram .
Now, it's ready for review again.
Please let me know if there is something to do more.
Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61942/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70011796
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
+# $example off:create_DataFrames$
+
+
+# $example on:untyped_transformations$
+# Create the DataFrame
+df <- read.json("examples/src/main/resources/people.json")
+
+# Show the content of the DataFrame
+showDF(df)
+## age  name
+## null Michael
+## 30   Andy
+## 19   Justin
+
+# Print the schema in a tree format
+printSchema(df)
+## root
+## |-- age: long (nullable = true)
+## |-- name: string (nullable = true)
+
+# Select only the "name" column
+showDF(select(df, "name"))
+## name
+## Michael
+## Andy
+## Justin
+
+# Select everybody, but increment the age by 1
+showDF(select(df, df$name, df$age + 1))
+## name(age + 1)
+## Michael null
+## Andy31
+## Justin  20
+
+# Select people older than 21
+showDF(where(df, df$age > 21))
+## age name
+## 30  Andy
+
+# Count people by age
+showDF(count(groupBy(df, "age")))
+## age  count
+## null 1
+## 19   1
+## 30   1
+# $example off:untyped_transformations$
+
+
+# $example on:sql_query$
+df <- sql("SELECT * FROM table")
--- End diff --

here should I add more to create the `table`? or just leave it since it's 
only for demonstration purpose?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61942 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61942/consoleFull)**
 for PR 14096 at commit 
[`c332c52`](https://github.com/apache/spark/commit/c332c52ba7fb9e23a372a74cd2ac6ea8b3704b5d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13969: [SPARK-16284][SQL] Implement reflect SQL function

2016-07-07 Thread petermaxlee

Github user petermaxlee commented on the issue:

https://github.com/apache/spark/pull/13969
  
Ping!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13374: [SPARK-13638][SQL] Add escapeAll option to CSV DataFrame...

2016-07-07 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13374
  
@jurriaan should this be called quoteAll rather than escapeAll?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...

2016-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13680
  
LGTM except some minor comments, it's pretty close! One easy-to-ignore 
comment: https://github.com/apache/spark/pull/13680/files#r69849567


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70011133
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
--- End diff --

@shivaram your idea is better, vote+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70011061
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
--- End diff --

sorry, just ignore above one

I re-build with `build/mvn -DskipTests -Psparkr package ` and everything 
works...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r70010656
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala
 ---
@@ -0,0 +1,215 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeArrayWriter}
+import org.apache.spark.util.Benchmark
+
+/**
+ * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData
+ * To run this:
+ *  1. replace ignore(...) with test(...)
+ *  2. build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark"
+ *
+ * Benchmarks in this file are skipped in normal builds.
+ */
+class UnsafeArrayDataBenchmark extends BenchmarkBase {
+
+  def calculateHeaderPortionInBytes(count: Int) : Int = {
+/* 4 + 4 * count // Use this expression for SPARK-15962 */
+UnsafeArrayData.calculateHeaderPortionInBytes(count)
+  }
+
+  def readUnsafeArray(iters: Int): Unit = {
+val count = 1024 * 1024 * 16
+val rand = new Random(42)
+
+val intPrimitiveArray = Array.fill[Int](count) { rand.nextInt }
+val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind()
+val intUnsafeArray = intEncoder.toRow(intPrimitiveArray).getArray(0)
+val readIntArray = { i: Int =>
+  var n = 0
+  while (n < iters) {
+val len = intUnsafeArray.numElements
+var sum = 0
+var i = 0
+while (i < len) {
+  sum += intUnsafeArray.getInt(i)
+  i += 1
+}
+n += 1
+  }
+}
+
+val doublePrimitiveArray = Array.fill[Double](count) { rand.nextDouble 
}
+val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind()
+val doubleUnsafeArray = 
doubleEncoder.toRow(doublePrimitiveArray).getArray(0)
+val readDoubleArray = { i: Int =>
+  var n = 0
+  while (n < iters) {
+val len = doubleUnsafeArray.numElements
+var sum = 0.0
+var i = 0
+while (i < len) {
+  sum += doubleUnsafeArray.getDouble(i)
+  i += 1
+}
+n += 1
+  }
+}
+
+val benchmark = new Benchmark("Read UnsafeArrayData", count * iters)
+benchmark.addCase("Int")(readIntArray)
+benchmark.addCase("Double")(readDoubleArray)
+benchmark.run
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4
+Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
+
+Read UnsafeArrayData:Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+Int279 /  294600.4 
  1.7   1.0X
+Double 296 /  303567.0 
  1.8   0.9X
+*/
+  }
+
+  def writeUnsafeArray(iters: Int): Unit = {
+val count = 1024 * 1024 * 16
+val rand = new Random(42)
+
+var intTotalLength: Int = 0
+val intPrimitiveArray = Array.fill[Int](count) { rand.nextInt }
+val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind()
+val writeIntArray = { i: Int =>
+  var n = 0
+  while (n < iters) {
+intTotalLength += 
intEncoder.toRow(intPrimitiveArray).getArray(0).numElements()
+n += 1
+  }
+}
+
+var doubleTotalLength: Int = 0
+val doublePrimitiveArray = Array.fill[Double](count) { rand.nextDouble 
}
+val doubleEncoder =

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r70010528
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -341,63 +327,115 @@ public UnsafeArrayData copy() {
 return arrayCopy;
   }
 
-  public static UnsafeArrayData fromPrimitiveArray(int[] arr) {
-if (arr.length > (Integer.MAX_VALUE - 4) / 8) {
-  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
-"it's too big.");
-}
+  @Override
+  public boolean[] toBooleanArray() {
+int size = numElements();
+boolean[] values = new boolean[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BOOLEAN_ARRAY_OFFSET, size);
+return values;
+  }
 
-final int offsetRegionSize = 4 * arr.length;
-final int valueRegionSize = 4 * arr.length;
-final int totalSize = 4 + offsetRegionSize + valueRegionSize;
-final byte[] data = new byte[totalSize];
+  @Override
+  public byte[] toByteArray() {
+int size = numElements();
+byte[] values = new byte[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
 
-Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length);
+  @Override
+  public short[] toShortArray() {
+int size = numElements();
+short[] values = new short[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.SHORT_ARRAY_OFFSET, size * 2);
+return values;
+  }
 
-int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4;
-int valueOffset = 4 + offsetRegionSize;
-for (int i = 0; i < arr.length; i++) {
-  Platform.putInt(data, offsetPosition, valueOffset);
-  offsetPosition += 4;
-  valueOffset += 4;
-}
+  @Override
+  public int[] toIntArray() {
+int size = numElements();
+int[] values = new int[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.INT_ARRAY_OFFSET, size * 4);
+return values;
+  }
 
-Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data,
-  Platform.BYTE_ARRAY_OFFSET + 4 + offsetRegionSize, valueRegionSize);
+  @Override
+  public long[] toLongArray() {
+int size = numElements();
+long[] values = new long[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.LONG_ARRAY_OFFSET, size * 8);
+return values;
+  }
 
-UnsafeArrayData result = new UnsafeArrayData();
-result.pointTo(data, Platform.BYTE_ARRAY_OFFSET, totalSize);
-return result;
+  @Override
+  public float[] toFloatArray() {
+int size = numElements();
+float[] values = new float[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.FLOAT_ARRAY_OFFSET, size * 4);
+return values;
   }
 
-  public static UnsafeArrayData fromPrimitiveArray(double[] arr) {
-if (arr.length > (Integer.MAX_VALUE - 4) / 12) {
+  @Override
+  public double[] toDoubleArray() {
+int size = numElements();
+double[] values = new double[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.DOUBLE_ARRAY_OFFSET, size * 8);
+return values;
+  }
+
+  private static UnsafeArrayData fromPrimitiveArray(
+   Object arr, int offset, int length, int elementSize) {
+final int headerInBytes = calculateHeaderPortionInBytes(length);
+final int valueRegionInBytes = elementSize * length;
+final int totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) 
/ 8;
--- End diff --

Sorry I was wrong :(  We need to declare them all as long, in order to do 
overflow check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14099
  
**[Test build #61944 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61944/consoleFull)**
 for PR 14099 at commit 
[`9ce8146`](https://github.com/apache/spark/commit/9ce81468b41c7515c6ce3fd4791e0ecc03a7de19).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14099: [SPARK-16432] Empty blocks fail to serialize due ...

2016-07-07 Thread ericl

GitHub user ericl opened a pull request:

https://github.com/apache/spark/pull/14099

[SPARK-16432] Empty blocks fail to serialize due to assert in 
ChunkedByteBuffer

## What changes were proposed in this pull request?

It's possible to also change the callers to not pass in empty chunks, but 
it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc 
@JoshRosen 

## How was this patch tested?

Unit tests, also checked that the original reproduction case in 
https://github.com/apache/spark/pull/11748#issuecomment-230760283 is resolved.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ericl/spark spark-16432

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14099.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14099


commit 9ce81468b41c7515c6ce3fd4791e0ecc03a7de19
Author: Eric Liang 
Date:   2016-07-08T00:18:29Z

Thu Jul  7 17:18:29 PDT 2016




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14028
  
**[Test build #61943 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61943/consoleFull)**
 for PR 14028 at commit 
[`6570a98`](https://github.com/apache/spark/commit/6570a9874e60ecb9366ea37a0e5dfe06b821dc62).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61942 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61942/consoleFull)**
 for PR 14096 at commit 
[`c332c52`](https://github.com/apache/spark/commit/c332c52ba7fb9e23a372a74cd2ac6ea8b3704b5d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14028
  
Thanks @yhuai! I just addressed your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70009456
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
--- End diff --

Do the unit tests pass ? We have a unit test for `showDF`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14098
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61941/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14098
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61940/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61941 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61941/consoleFull)**
 for PR 14096 at commit 
[`08672d9`](https://github.com/apache/spark/commit/08672d98c682e1bbe78399c4b0814bfa28d45826).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14098
  
**[Test build #61940 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61940/consoleFull)**
 for PR 14098 at commit 
[`d92d933`](https://github.com/apache/spark/commit/d92d933937571b65670a2a308c936c0d9061b382).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70008864
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
--- End diff --

I just ran `showDF()` and it seems this method not working, while `head()` 
works fine.

Is it my problem when building sparkR by `build/mvn -DskipTests -Psparkr 
-Phive package`?

```
> df <- read.json("examples/src/main/resources/people.json")

> showDF(df)
16/07/07 17:02:54 WARN RBackendHandler: cannot find matching method class 
org.apache.spark.sql.Dataset.showString. Candidates are:
16/07/07 17:02:54 WARN RBackendHandler: showString(int,int)
16/07/07 17:02:54 ERROR RBackendHandler: showString on 7 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

> showDF(select(df, "name"))
16/07/07 17:03:38 WARN RBackendHandler: cannot find matching method class 
org.apache.spark.sql.Dataset.showString. Candidates are:
16/07/07 17:03:38 WARN RBackendHandler: showString(int,int)
16/07/07 17:03:38 ERROR RBackendHandler: showString on 18 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61941 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61941/consoleFull)**
 for PR 14096 at commit 
[`08672d9`](https://github.com/apache/spark/commit/08672d98c682e1bbe78399c4b0814bfa28d45826).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70008649
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
--- End diff --

I generally ask people to add new configs to 
`core/src/main/scala/org/apache/spark/internal/config/package.scala`, but no 
big deal either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70008186
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala ---
@@ -281,6 +300,174 @@ class TaskSchedulerImplSuite extends SparkFunSuite 
with LocalSparkContext with B
 assert(!failedTaskSet)
   }
 
+  test("scheduled tasks obey task and stage blacklists") {
+val blacklist = mock[BlacklistTracker]
+taskScheduler = setupScheduler(blacklist)
+val stage0 = FakeTask.createTaskSet(numTasks = 2, stageId = 0, 
stageAttemptId = 0)
+val stage1 = FakeTask.createTaskSet(numTasks = 2, stageId = 1, 
stageAttemptId = 0)
+val stage2 = FakeTask.createTaskSet(numTasks = 2, stageId = 2, 
stageAttemptId = 0)
+taskScheduler.submitTasks(stage0)
+taskScheduler.submitTasks(stage1)
+taskScheduler.submitTasks(stage2)
+
+val offers = Seq(
+  new WorkerOffer("executor0", "host0", 1),
+  new WorkerOffer("executor1", "host1", 1),
+  new WorkerOffer("executor2", "host1", 1),
+  new WorkerOffer("executor3", "host2", 10)
+)
+
+// setup our mock blacklist:
+// stage 0 is blacklisted on node "host1"
+// stage 1 is blacklist on executor "executor3"
--- End diff --

nit: blacklisted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61938/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70008001
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
--- End diff --

I'd use `head` as the default one for most examples. It feels most natural. 
We can then add one line to the programming guide that reads like "You can also 
`showDF` to print the first few rows and optionally truncate the printing of 
long values"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61938 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61938/consoleFull)**
 for PR 14096 at commit 
[`65f2236`](https://github.com/apache/spark/commit/65f2236970d42ab1ee8115a5f0102119504e633f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70007837
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -682,11 +682,12 @@ private[spark] class ApplicationMaster(
 }
 
 override def receiveAndReply(context: RpcCallContext): 
PartialFunction[Any, Unit] = {
-  case RequestExecutors(requestedTotal, localityAwareTasks, 
hostToLocalTaskCount) =>
+  case RequestExecutors(
--- End diff --

Wonder if the argument list is really helping here. e.g.:

```
case r: RequestExecutors =>
  a.requestTotalExecutorsWithPreferredLocalities(r.requestedTotal, ...)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...

2016-07-07 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14082#discussion_r70007444
  
--- Diff: examples/src/main/r/RSparkSQLExample.R ---
@@ -0,0 +1,175 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# $example on:init_session$
+sparkR.session()
+# $example off:init_session$
+
+
+# $example on:create_DataFrames$
+df <- read.json("examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame
+showDF(df)
--- End diff --

you mean all `showDF()` be replaced by `head()`? eg. change 
`showDF(select(df, "name"))` to `head(select(df, "name"))` too? 

or should we leave both `showDF()` and `head()` as examples to reader?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70007341
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14098
  
**[Test build #61940 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61940/consoleFull)**
 for PR 14098 at commit 
[`d92d933`](https://github.com/apache/spark/commit/d92d933937571b65670a2a308c936c0d9061b382).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70007113
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70006954
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14096#discussion_r70006844
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1804,11 +1804,11 @@ test_that("describe() and summarize() on a 
DataFrame", {
   expect_equal(collect(stats)[2, "age"], "24.5")
   expect_equal(collect(stats)[3, "age"], "7.7781745930520225")
   stats <- describe(df)
-  expect_equal(collect(stats)[4, "name"], "Andy")
+  expect_equal(columns(stats), c("summary", "age"))
--- End diff --

Sure. I agree. That would be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...

2016-07-07 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14096#discussion_r70006591
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1804,11 +1804,11 @@ test_that("describe() and summarize() on a 
DataFrame", {
   expect_equal(collect(stats)[2, "age"], "24.5")
   expect_equal(collect(stats)[3, "age"], "7.7781745930520225")
   stats <- describe(df)
-  expect_equal(collect(stats)[4, "name"], "Andy")
+  expect_equal(columns(stats), c("summary", "age"))
--- End diff --

Sorry one last thing - instead of removing the previous test can we add a 
new assert ? 
Also maybe can we add the failing test case from the JIRA as a new test 
case ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61935/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14096
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61935 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61935/consoleFull)**
 for PR 14096 at commit 
[`97a158e`](https://github.com/apache/spark/commit/97a158e9e91c800b2be7682d6cc77b86d047626c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14097: [MINOR][Streaming][Docs] Minor changes on kinesis integr...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14097
  
**[Test build #61937 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61937/consoleFull)**
 for PR 14097 at commit 
[`0740b2b`](https://github.com/apache/spark/commit/0740b2b053d854f394aa2094a29dd878e4d4bd5d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14097: [MINOR][Streaming][Docs] Minor changes on kinesis integr...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14097
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14097: [MINOR][Streaming][Docs] Minor changes on kinesis integr...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14097
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61937/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70005830
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -236,29 +246,43 @@ private[spark] class TaskSchedulerImpl(
* given TaskSetManager have completed, so state associated with the 
TaskSetManager should be
* cleaned up.
*/
-  def taskSetFinished(manager: TaskSetManager): Unit = synchronized {
+  def taskSetFinished(manager: TaskSetManager, success: Boolean): Unit = 
synchronized {
 taskSetsByStageIdAndAttempt.get(manager.taskSet.stageId).foreach { 
taskSetsForStage =>
   taskSetsForStage -= manager.taskSet.stageAttemptId
   if (taskSetsForStage.isEmpty) {
 taskSetsByStageIdAndAttempt -= manager.taskSet.stageId
   }
 }
 manager.parent.removeSchedulable(manager)
-logInfo("Removed TaskSet %s, whose tasks have all completed, from pool 
%s"
-  .format(manager.taskSet.id, manager.parent.name))
+if (success) {
+  blacklistTracker.map(_.taskSetSucceeded(manager.taskSet.stageId, 
this))
--- End diff --

`.foreach`. This happens in other places too, I'll refrain from pointing 
out all of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14098
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14098
  
**[Test build #61939 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61939/consoleFull)**
 for PR 14098 at commit 
[`94df090`](https://github.com/apache/spark/commit/94df0908d17d28eb6c0223456df7daad24899e47).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14098
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61939/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70005741
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -52,13 +51,23 @@ import org.apache.spark.util.{AccumulatorV2, 
ThreadUtils, Utils}
  * acquire a lock on us, so we need to make sure that we don't try to lock 
the backend while
  * we are holding a lock on ourselves.
  */
-private[spark] class TaskSchedulerImpl(
+private[spark] class TaskSchedulerImpl private[scheduler](
 val sc: SparkContext,
 val maxTaskFailures: Int,
+private[scheduler] val blacklistTracker: Option[BlacklistTracker],
+private val clock: Clock = new SystemClock,
 isLocal: Boolean = false)
   extends TaskScheduler with Logging
 {
-  def this(sc: SparkContext) = this(sc, 
sc.conf.getInt("spark.task.maxFailures", 4))
+  def this(sc: SparkContext) = {
+this(sc, sc.conf.getInt("spark.task.maxFailures", 4),
+  TaskSchedulerImpl.createBlacklistTracker(sc.conf))
+  }
+
+  def this(sc: SparkContext, maxTaskFailures: Int, isLocal: Boolean) = {
+this(sc, maxTaskFailures, 
TaskSchedulerImpl.createBlacklistTracker(sc.conf),
+  clock = new SystemClock, isLocal)
--- End diff --

nit: also use arg name for `isLocal`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70005706
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -153,6 +162,7 @@ private[spark] class TaskSchedulerImpl(
 
   override def start() {
 backend.start()
+blacklistTracker.map(_.start())
--- End diff --

`.foreach`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14098
  
**[Test build #61939 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61939/consoleFull)**
 for PR 14098 at commit 
[`94df090`](https://github.com/apache/spark/commit/94df0908d17d28eb6c0223456df7daad24899e47).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread sharkdtu

Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/14088
  
@tgravescs  fixed the description and style


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-07 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/14098
  
@liancheng Can you review it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70005525
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...

2016-07-07 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/14098

[SPARK-16380][SQL][Example]:Update SQL examples and programming guide for 
Python language binding

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
In current sql-programming-guide.md, Python examples are hard coded in the 
md file.

I update the file by adding a separate SparkSQLExample.py as ml examples.

In this file, I included all working and hard-coded examples as a 
self-contained application, except for Hive examples. For example, 
spark.refershtable, which doesn't exist in SparkSession. We can revisit these 
examples and put it in the self-contained application.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
Manual tests:
./bin/spark-submit examples/src/main/python/SparkSQLExample.py
Build docs and check generated document including correct examples as ml 
document.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark sql

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14098.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14098






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61938 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61938/consoleFull)**
 for PR 14096 at commit 
[`65f2236`](https://github.com/apache/spark/commit/65f2236970d42ab1ee8115a5f0102119504e633f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70005272
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70005147
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r70004452
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{HashMap, HashSet}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+import org.apache.spark.util.SystemClock
+import org.apache.spark.util.Utils
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * specific (executor, task) pairs within a stage, blacklisting entire 
executors and nodes for a
+ * stage, and blacklisting executors and nodes across an entire 
application (with a periodic
+ * expiry).
+ *
+ * The tracker needs to deal with a variety of workloads, eg.: bad user 
code, which may lead to many
+ * task failures, but that should not count against individual executors; 
many small stages, which
+ * may prevent a bad executor for having many failures within one stage, 
but still many failures
+ * over the entire application; "flaky" executors, that don't fail every 
task, but are still
+ * faulty; etc.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+  * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+  * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  private val MAX_TASK_FAILURES_PER_NODE =
+conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2)
+  private val MAX_FAILURES_PER_EXEC =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2)
+  private val MAX_FAILURES_PER_EXEC_STAGE =
+conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2)
+  private val MAX_FAILED_EXEC_PER_NODE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2)
+  private val MAX_FAILED_EXEC_PER_NODE_STAGE =
+conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2)
+  val EXECUTOR_RECOVERY_MILLIS = 
BlacklistTracker.getBlacklistExpiryTime(conf)
+
+  // a count of failed tasks for each executor.  Only counts failures 
after tasksets complete
+  // successfully
+  private val executorIdToFailureCount: HashMap[String, Int] = HashMap()
+  // failures for each executor by stage.  Only tracked while the stage is 
running.
+  val stageIdToExecToFailures: HashMap[Int, HashMap[String, 
FailureStatus]] =
+new HashMap()
+  val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, 
HashSet[Int]]] =
+new HashMap()
+  val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new 
HashMap()
+  private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new 
HashMap()
+  private val _nodeBlacklist: AtomicReference[Set[String]] = new 
AtomicReference(Set())
+  private var nextExpiryTime: Long = Long.MaxValue
+
+  def start(): Unit = {}
+
+  def stop(): Unit = {}
+
+  def expireExecutorsInBlacklist(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < 
now).keys
+  if (execsToClear.nonEmpty) {
+logInfo(s"Removing executors $execsToClear from blacklist during 
periodic recovery")
+execsToClear.foreach { exec => 
executorIdToBlacklistExpiryTime.remove(exec) }
+  }

[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14096#discussion_r7000
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1804,11 +1804,11 @@ test_that("describe() and summarize() on a 
DataFrame", {
   expect_equal(collect(stats)[2, "age"], "24.5")
   expect_equal(collect(stats)[3, "age"], "7.7781745930520225")
   stats <- describe(df)
-  expect_equal(collect(stats)[4, "name"], "Andy")
+  expect_equal(columns(summary(df)), c("summary", "age"))
   expect_equal(collect(stats)[5, "age"], "30")
 
   stats2 <- summary(df)
-  expect_equal(collect(stats2)[4, "name"], "Andy")
+  expect_equal(columns(summary(df)), c("summary", "age"))
--- End diff --

And, here, too. I'll fix soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

2016-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14071#discussion_r70004411
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -939,42 +940,33 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 // to include the partition columns here explicitly
 val schema = cols ++ partitionCols
 
-// Storage format
-val defaultStorage: CatalogStorageFormat = {
-  val defaultStorageType = 
conf.getConfString("hive.default.fileformat", "textfile")
-  val defaultHiveSerde = HiveSerDe.sourceToSerDe(defaultStorageType, 
conf)
-  CatalogStorageFormat(
-locationUri = None,
-inputFormat = defaultHiveSerde.flatMap(_.inputFormat)
-  .orElse(Some("org.apache.hadoop.mapred.TextInputFormat")),
-outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
-  
.orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
-// Note: Keep this unspecified because we use the presence of the 
serde to decide
-// whether to convert a table created by CTAS to a datasource 
table.
-serde = None,
-compressed = false,
-serdeProperties = Map())
-}
 validateRowFormatFileFormat(ctx.rowFormat, ctx.createFileFormat, ctx)
-val fileStorage = 
Option(ctx.createFileFormat).map(visitCreateFileFormat)
-  .getOrElse(CatalogStorageFormat.empty)
-val rowStorage = Option(ctx.rowFormat).map(visitRowFormat)
-  .getOrElse(CatalogStorageFormat.empty)
-val location = Option(ctx.locationSpec).map(visitLocationSpec)
+var storage = CatalogStorageFormat(
+  locationUri = Option(ctx.locationSpec).map(visitLocationSpec),
+  provider = Some("hive"),
+  properties = Map.empty)
+Option(ctx.createFileFormat).foreach(ctx => storage = 
getFileFormat(ctx, storage))
+Option(ctx.rowFormat).foreach(ctx => storage = getRowFormat(ctx, 
storage))
--- End diff --

uh,,, `defaultStorage.serde` is still `None`. Then, I think it should be 
fine. Maybe we just need to add a comment to explain how we determine the value 
of `serde`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

101 - 200 of 591 matches

Mail list logo