[GitHub] spark pull request: [SPARK-12659] fix NPE in UnsafeExternalSorter ...

2016-01-05 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/10606#discussion_r48912881
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
 ---
@@ -223,14 +227,9 @@ public void loadNext() {
* {@code next()} will return the same mutable object.
*/
   public SortedIterator getSortedIterator() {
-sorter.sort(array, 0, pos / 2, sortComparator);
-return new SortedIterator(pos / 2);
-  }
-
-  /**
-   * Returns an iterator over record pointers in original order (inserted).
-   */
-  public SortedIterator getIterator() {
+if (sortComparator != null) {
+  sorter.sort(array, 0, pos / 2, sortComparator);
--- End diff --

No sorting is needed, only spilling is needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12379][ML][MLLIB] Copy GBT implementati...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10607#issuecomment-169171784
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48795/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3873] [core] Import ordering fixes.

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10578#issuecomment-169172279
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12630][DOC] Update param descriptions

2016-01-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/10598#discussion_r48913054
  
--- Diff: python/pyspark/mllib/classification.py ---
@@ -94,16 +94,18 @@ class 
LogisticRegressionModel(LinearClassificationModel):
 Classification model trained using Multinomial/Binary Logistic
 Regression.
 
-:param weights: Weights computed for every feature.
-:param intercept: Intercept computed for this model. (Only used
-in Binary Logistic Regression. In Multinomial Logistic
-Regression, the intercepts will not be a single value,
-so the intercepts will be part of the weights.)
-:param numFeatures: the dimension of the features.
-:param numClasses: the number of possible outcomes for k classes
-classification problem in Multinomial Logistic Regression.
-By default, it is binary logistic regression so numClasses
-will be set to 2.
+:param weights:
+  Weights computed for every feature.
+:param intercept:
+  Intercept computed for this model. (Only used in Binary Logistic
+  Regression. In Multinomial Logistic Regression, the intercepts will 
not
+  be a single value, so the intercepts will be part of the weights.)
+:param numFeatures:
+  the dimension of the features.
--- End diff --

nit:  capitalize the first word in the description sentence


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...

2016-01-05 Thread vectorijk
Github user vectorijk commented on the pull request:

https://github.com/apache/spark/pull/10527#issuecomment-169172817
  
cc @cloud-fan @marmbrus @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12593][SQL][WIP] Converts resolved logi...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10541#issuecomment-168982479
  
**[Test build #48763 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48763/consoleFull)**
 for PR 10541 at commit 
[`1e50288`](https://github.com/apache/spark/commit/1e50288d6f956608b53554d31bd394bf919812e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10498#issuecomment-168982645
  
**[Test build #48765 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48765/consoleFull)**
 for PR 10498 at commit 
[`3ff968b`](https://github.com/apache/spark/commit/3ff968b29d3852c92952454254ae6e1f7ba6599d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10583#issuecomment-168984129
  
**[Test build #48766 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48766/consoleFull)**
 for PR 10583 at commit 
[`e397370`](https://github.com/apache/spark/commit/e39737023920c3916ad8ed6e4d4b46072bfe4f7a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10570#issuecomment-168985619
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10570#issuecomment-168985593
  
**[Test build #48762 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48762/consoleFull)**
 for PR 10570 at commit 
[`bfaa1fa`](https://github.com/apache/spark/commit/bfaa1fa79430030d7315cd6530f3da86c0eb39e1).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10570#issuecomment-168985620
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48762/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10602#issuecomment-168986043
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-05 Thread vijaykiran
Github user vijaykiran commented on a diff in the pull request:

https://github.com/apache/spark/pull/10602#discussion_r48839934
  
--- Diff: python/pyspark/mllib/fpm.py ---
@@ -130,15 +133,22 @@ def train(cls, data, minSupport=0.1, 
maxPatternLength=10, maxLocalProjDBSize=320
 """
 Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
 
-:param data: The input data set, each element contains a sequnce 
of itemsets.
-:param minSupport: the minimal support level of the sequential 
pattern, any pattern appears
-more than  (minSupport * size-of-the-dataset) times will be 
output (default: `0.1`)
-:param maxPatternLength: the maximal length of the sequential 
pattern, any pattern appears
-less than maxPatternLength will be output. (default: `10`)
-:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in
-the internal storage format) allowed in a projected database 
before local
-processing. If a projected database exceeds this size, another
-iteration of distributed prefix growth is run. (default: 
`3200`)
+:param data:
+  The input data set, each element contains a sequnce of itemsets.
+:param minSupport:
+  The minimal support level of the sequential pattern, any pattern 
appears
+  more than  (minSupport * size-of-the-dataset) times will be 
output.
+  default: `0.1`)
--- End diff --

I think the format should be (default: `0.1`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-05 Thread vijaykiran
Github user vijaykiran commented on a diff in the pull request:

https://github.com/apache/spark/pull/10602#discussion_r48839858
  
--- Diff: python/pyspark/mllib/fpm.py ---
@@ -68,11 +68,14 @@ def train(cls, data, minSupport=0.3, numPartitions=-1):
 """
 Computes an FP-Growth model that contains frequent itemsets.
 
-:param data: The input data set, each element contains a
-transaction.
-:param minSupport: The minimal support level (default: `0.3`).
-:param numPartitions: The number of partitions used by
-parallel FP-growth (default: same as input data).
+ :param data:
+   The input data set, each element contains a transaction.
+ :param minSupport:
+   The minimal support level
+   (default: `0.3`)
+ :param numPartitions:The number of partitions used by parallel 
FP-growth
--- End diff --

You missed this one :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11373] [CORE] Add metrics to the Histor...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9571#issuecomment-168988730
  
**[Test build #48768 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48768/consoleFull)**
 for PR 9571 at commit 
[`d6fa568`](https://github.com/apache/spark/commit/d6fa568fab72a2c4d57ecfcd304d000379534990).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-05 Thread somideshmukh
GitHub user somideshmukh opened a pull request:

https://github.com/apache/spark/pull/10602

[SPARK-12632][Python][Make Parameter Descriptions Consistent for PySpark 
MLlib FPM and Recommendation]

Made changes in FPM file ,Recommendation file doesnot contain param changes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/somideshmukh/spark Branch12632-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10602.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10602


commit 5b53e88794ecb7c9a8a7f8b68aa8a3fb7c3ac7e3
Author: somideshmukh 
Date:   2016-01-05T12:18:51Z

[SPARK-12632][Python][Make Parameter Descriptions Consistent for PySpark 
MLlib FPM and Recommendation]




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11315] [YARN] Add YARN extension servic...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8744#issuecomment-168988728
  
**[Test build #48769 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48769/consoleFull)**
 for PR 8744 at commit 
[`90a91c9`](https://github.com/apache/spark/commit/90a91c987bbeeb36bd0af36f743871eeb05fa5e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1537] [YARN] Add history provider for Y...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10545#issuecomment-168990626
  
**[Test build #48767 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48767/consoleFull)**
 for PR 10545 at commit 
[`8d781db`](https://github.com/apache/spark/commit/8d781dbb4871383d43cd4d03776da5c617c6b0da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [STREAMING][DOCS][EXAMPLES] Minor fixes

2016-01-05 Thread jaceklaskowski
GitHub user jaceklaskowski opened a pull request:

https://github.com/apache/spark/pull/10603

[STREAMING][DOCS][EXAMPLES] Minor fixes



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jaceklaskowski/spark 
streaming-actor-custom-receiver

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10603


commit 5629feb2f43df706f0664b67c098d11a3c0b7185
Author: Jacek Laskowski 
Date:   2016-01-05T12:55:25Z

[STREAMING][DOCS][EXAMPLES] Minor fixes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11315] [YARN] Add YARN extension servic...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8744#issuecomment-168991865
  
**[Test build #48769 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48769/consoleFull)**
 for PR 8744 at commit 
[`90a91c9`](https://github.com/apache/spark/commit/90a91c987bbeeb36bd0af36f743871eeb05fa5e4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11315] [YARN] Add YARN extension servic...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8744#issuecomment-168991988
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12504][SQL] Masking credentials in the ...

2016-01-05 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10452#issuecomment-169192724
  
Thanks, merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3873] [sql] Import ordering fixes.

2016-01-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10573


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12504][SQL] Masking credentials in the ...

2016-01-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10452


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12581][SQL] Support case-sensitive tabl...

2016-01-05 Thread maropu
Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/10523#issuecomment-169193276
  
@yhuai Yes, quoted tables in postgres are always case-sensitive. We need to 
support case-insensitive table names? Table names in sparksql 
(`DataFrame#registerTempTable`) and typical databases such as oracle and mysql 
are also case-sensitive, so IMO we need to comply with the rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11696] [ML, MLlib] Optimization: Extend...

2016-01-05 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9667#issuecomment-169193588
  
I see.  Adding this seems reasonable since some spark.ml algorithms depend 
on these APIs.  However, I want to avoid breaking the public optimization APIs 
in spark.mllib.  (That should also let you make fewer corrections to the test 
suites and callers of the methods.)  I'll make a few suggestions for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11696] [ML, MLlib] Optimization: Extend...

2016-01-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9667#discussion_r48920477
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -133,7 +133,7 @@ class GradientDescent private[spark] (private var 
gradient: Gradient, private va
   miniBatchFraction,
   initialWeights,
   convergenceTol)
-weights
+(weights, lossHistory.last, iter)
--- End diff --

It'd be nice to return the whole loss history.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11696] [ML, MLlib] Optimization: Extend...

2016-01-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9667#discussion_r48920506
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -178,7 +178,7 @@ object GradientDescent extends Logging {
   regParam: Double,
   miniBatchFraction: Double,
   initialWeights: Vector,
-  convergenceTol: Double): (Vector, Array[Double]) = {
+  convergenceTol: Double): (Vector, Array[Double], Integer) = {
--- End diff --

Do you have to change the API here?  The loss history should have length = 
num iterations, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11696] [ML, MLlib] Optimization: Extend...

2016-01-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9667#discussion_r48920464
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -122,8 +122,8 @@ class GradientDescent private[spark] (private var 
gradient: Gradient, private va
* @return solution vector
*/
   @DeveloperApi
-  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
-val (weights, _) = GradientDescent.runMiniBatchSGD(
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
(Vector, Double, Integer) = {
--- End diff --

This API should not be changed.  You could add a new method 
(```optimizeWithStats```?) which returns the 3 values, and then share the 
implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

2016-01-05 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10593#discussion_r48920875
  
--- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable
+
+import org.apache.commons.lang3.SystemUtils
+import org.apache.spark.util.Utils
+
+/**
+ * Utility class to benchmark components. An example of how to use this is:
+ *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
+ *   benchmark.addCase("V1", ")
+ *   benchmark.addCase("V2", ")
+ *   benchmark.run
+ * This will output the average time to run each function and the rate of 
each function.
+ *
+ * The benchmark function takes one argument that is the iteration that's 
being run
+ */
+class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
+  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
+
+  def addCase(name: String, f: Int => Unit): Unit = {
+benchmarks += Benchmark.Case(name, f)
+  }
+
+  /**
+   * Runs the benchmark and outputs the results to stdout. This should be 
copied and added as
+   * a comment with the benchmark. Although the results vary from machine 
to machine, it should
+   * provide some baseline.
+   */
+  def run(): Unit = {
+require(benchmarks.nonEmpty)
+val results = benchmarks.map { c =>
+  Benchmark.measure(valuesPerIteration, c.fn, iters)
+}
+val firstRate = results.head.avgRate
+// scalastyle:off
+// The results are going to be processor specific so it is useful to 
include that.
+println(Benchmark.getProcessorName())
+printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg 
Rate(M/s)", "Relative Rate")
+
println("---")
+results.zip(benchmarks).foreach { r =>
+  printf("%-30s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, 
"%10.2f" format r._1.avgRate,
+"%6.2f X" format (r._1.avgRate / firstRate))
+}
+println
+// scalastyle:on
+  }
+}
+
+object Benchmark {
+  case class Case(name: String, fn: Int => Unit)
+  case class Result(avgMs: Double, avgRate: Double)
+
+  /**
+   * This should return a user helpful processor information. Getting at 
this depends on the OS.
+   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 
2.50GHz"
+   */
+  def getProcessorName(): String = {
+if (SystemUtils.IS_OS_MAC_OSX) {
+  Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", 
"machdep.cpu.brand_string"))
+} else if (SystemUtils.IS_OS_LINUX) {
+  Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model 
name\"", "/proc/cpuinfo"))
+} else {
+  System.getenv("PROCESSOR_IDENTIFIER")
+}
+  }
+
+  /**
+   * Runs a single function `f` for iters, returning the average time the 
function took and
+   * the rate of the function.
+   */
+  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
+var totalTime = 0L
+for (i <- 0 until iters + 1) {
+  val start = System.currentTimeMillis()
--- End diff --

How about calling System.nanoTime() for short-running benchmarks instead of 
System.currentTimeMillis()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...

2016-01-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10583#issuecomment-169196055
  
cc @cloud-fan  can you take a look at this? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169196126
  
**[Test build #48808 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48808/consoleFull)**
 for PR 10609 at commit 
[`0228eef`](https://github.com/apache/spark/commit/0228eef185e379e80cd3622194e785187f673bce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11531] [ML] : SparseVector error Msg

2016-01-05 Thread rekhajoshm
Github user rekhajoshm commented on the pull request:

https://github.com/apache/spark/pull/9525#issuecomment-169196705
  
Thanks @jkbradley might have missed it or thought it was under 
discussion.updated.thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...

2016-01-05 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10589#discussion_r48921242
  
--- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable
+
+import org.apache.commons.lang3.SystemUtils
+import org.apache.spark.util.Utils
+
+/**
+ * Utility class to benchmark components. An example of how to use this is:
+ *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
+ *   benchmark.addCase("V1", ")
+ *   benchmark.addCase("V2", ")
+ *   benchmark.run
+ * This will output the average time to run each function and the rate of 
each function.
+ *
+ * The benchmark function takes one argument that is the iteration that's 
being run
+ */
+class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
+  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
+
+  def addCase(name: String, f: Int => Unit): Unit = {
+benchmarks += Benchmark.Case(name, f)
+  }
+
+  /**
+   * Runs the benchmark and outputs the results to stdout. This should be 
copied and added as
+   * a comment with the benchmark. Although the results vary from machine 
to machine, it should
+   * provide some baseline.
+   */
+  def run(): Unit = {
+require(benchmarks.nonEmpty)
+val results = benchmarks.map { c =>
+  Benchmark.measure(valuesPerIteration, c.fn, iters)
+}
+val firstRate = results.head.avgRate
+// scalastyle:off
+// The results are going to be processor specific so it is useful to 
include that.
+println(Benchmark.getProcessorName())
+printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg 
Rate(M/s)", "Relative Rate")
+
println("-")
+results.zip(benchmarks).foreach { r =>
+  printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, 
"%10.2f" format r._1.avgRate,
+"%6.2f X" format (r._1.avgRate / firstRate))
+}
+println
+// scalastyle:on
+  }
+}
+
+object Benchmark {
+  case class Case(name: String, fn: Int => Unit)
+  case class Result(avgMs: Double, avgRate: Double)
+
+  /**
+   * This should return a user helpful processor information. Getting at 
this depends on the OS.
+   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 
2.50GHz"
+   */
+  def getProcessorName(): String = {
+if (SystemUtils.IS_OS_MAC_OSX) {
+  Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", 
"machdep.cpu.brand_string"))
+} else if (SystemUtils.IS_OS_LINUX) {
+  Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model 
name\"", "/proc/cpuinfo"))
+} else {
+  System.getenv("PROCESSOR_IDENTIFIER")
+}
+  }
+
+  /**
+   * Runs a single function `f` for iters, returning the average time the 
function took and
+   * the rate of the function.
+   */
+  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
+var totalTime = 0L
+for (i <- 0 until iters + 1) {
+  val start = System.currentTimeMillis()
--- End diff --

How about the calling System.nanoTime() for short-running benchmarks 
instead of calling System.currentTimeMillis()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...

2016-01-05 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10583#discussion_r48921238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -587,6 +586,13 @@ class Analyzer(
 case other => other
   }
 }
+  case u @ UnresolvedGenerator(name, children) =>
--- End diff --

Do we need to add `UnresolvedGenerator`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3873] [core] Import ordering fixes.

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10578#issuecomment-169197671
  
**[Test build #48799 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48799/consoleFull)**
 for PR 10578 at commit 
[`c7bee0a`](https://github.com/apache/spark/commit/c7bee0a3ba32adb4c348bbada71d163fc6770384).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3873] [core] Import ordering fixes.

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10578#issuecomment-169198124
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3873] [core] Import ordering fixes.

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10578#issuecomment-169198126
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48799/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12400][Shuffle] Avoid generating temp s...

2016-01-05 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/10376#issuecomment-169198730
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...

2016-01-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10498#discussion_r48921611
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/bucket.scala 
---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.sources.{OutputWriter, OutputWriterFactory, 
HadoopFsRelationProvider, HadoopFsRelation}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * A container for bucketing information.
+ * Bucketing is a technology for decomposing data sets into more 
manageable parts, and the number
+ * of buckets is fixed so it does not fluctuate with data.
+ *
+ * @param numBuckets number of buckets.
+ * @param bucketColumnNames the names of the columns that used to generate 
the bucket id.
+ * @param sortColumnNames the names of the columns that used to sort data 
in each bucket.
+ */
+private[sql] case class BucketSpec(
+numBuckets: Int,
+bucketColumnNames: Seq[String],
+sortColumnNames: Seq[String])
+
+private[sql] trait BucketedHadoopFsRelationProvider extends 
HadoopFsRelationProvider {
--- End diff --

should we expose the bucket API to users so that they can implement data 
source supporting bucketing?

cc @rxin @nongli 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12400][Shuffle] Avoid generating temp s...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10376#issuecomment-169200095
  
**[Test build #48809 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48809/consoleFull)**
 for PR 10376 at commit 
[`7837b06`](https://github.com/apache/spark/commit/7837b0601299da5ba42d45e5279b9c1449a7d619).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    4   5   6   7   8   9