[jira] [Created] (SPARK-9845) Add built-in UDF

2015-08-11 Thread Alex Liu (JIRA)
Alex Liu created SPARK-9845:
---

 Summary: Add built-in UDF
 Key: SPARK-9845
 URL: https://issues.apache.org/jira/browse/SPARK-9845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1, 1.3.1
Reporter: Alex Liu


Hive has many built-in functions as in 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Can we add similar functions to Spark SQL?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map

2015-08-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-9847:


 Summary: ML Params copyValues should copy default values to 
default map, not set map
 Key: SPARK-9847
 URL: https://issues.apache.org/jira/browse/SPARK-9847
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical


Currently, Params.copyValues copies default parameter values to the paramMap of 
the target instance, rather than the defaultParamMap.  It should copy to the 
defaultParamMap because explicitly setting a parameter can change the semantics.

This issue arose in [SPARK-9789], where 2 params threshold and thresholds 
for LogisticRegression can have mutually exclusive values.  If thresholds is 
set, then fit() will copy the default value of threshold as well, easily 
resulting in inconsistent settings for the 2 params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9816) Support BinaryType in Concat

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9816:
---
Target Version/s: 1.6.0

 Support BinaryType in Concat
 

 Key: SPARK-9816
 URL: https://issues.apache.org/jira/browse/SPARK-9816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Takeshi Yamamuro

 Support BinaryType in catalyst Concat according to hive behaviours.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8925) Add @since tags to mllib.util

2015-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8925.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7436
[https://github.com/apache/spark/pull/7436]

 Add @since tags to mllib.util
 -

 Key: SPARK-8925
 URL: https://issues.apache.org/jira/browse/SPARK-8925
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8925) Add @since tags to mllib.util

2015-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8925:
-
Assignee: Sudhakar Thota

 Add @since tags to mllib.util
 -

 Key: SPARK-8925
 URL: https://issues.apache.org/jira/browse/SPARK-8925
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Sudhakar Thota
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2015-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692293#comment-14692293
 ] 

Yin Huai commented on SPARK-9740:
-

Actually, seems our old first/last functions do not respect nulls.

 first/last aggregate NULL behavior
 --

 Key: SPARK-9740
 URL: https://issues.apache.org/jira/browse/SPARK-9740
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell
Assignee: Yin Huai

 The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
 return the first or last non-null value (if any) found. This is a departure 
 from the behavior of the old FIRST/LAST aggregates and from the 
 FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
 if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
 this behavior for the old UDAF interface.
 Hive makes this behavior configurable, by adding a skipNulls flag. I would 
 suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9848) Add @since tag to new public APIs in 1.5

2015-08-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9848:


 Summary: Add @since tag to new public APIs in 1.5
 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9848) Add @since tag to new public APIs in 1.5

2015-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9848:
-
Labels: starter  (was: )

 Add @since tag to new public APIs in 1.5
 

 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng
  Labels: starter

 We should get a list of new APIs from SPARK-9660. cc: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9848) Add @since tag to new public APIs in 1.5

2015-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9848:
-
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-7751

 Add @since tag to new public APIs in 1.5
 

 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng
  Labels: starter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-08-11 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692303#comment-14692303
 ] 

Xiangrui Meng commented on SPARK-7751:
--

This issue is addressed in SPARK-8967. We tried to use annotation instead of 
JavaDoc tag for since. However, I didn't find a way to make it work.

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 (Do NOT tag private or package private classes or methods.)
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8971:
---

Assignee: Seth Hendrickson  (was: Apache Spark)

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-11 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692312#comment-14692312
 ] 

Seth Hendrickson commented on SPARK-8971:
-

I went ahead and created the PR for this issue, even though some of the design 
choices still merit discussion. This way, others can at least see the code and 
make comments. I did not mark as WIP but I can do that if needed. 

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8967) Implement @since as an annotation

2015-08-11 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692318#comment-14692318
 ] 

Xiangrui Meng commented on SPARK-8967:
--

One example is `deprecated` annotation in Scala: 
https://github.com/scala/scala/blob/2.10.x/src/library/scala/deprecated.scala. 
However, ScalaDoc may have special handling for this annotation.

 Implement @since as an annotation
 -

 Key: SPARK-8967
 URL: https://issues.apache.org/jira/browse/SPARK-8967
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
   Original Estimate: 1h
  Remaining Estimate: 1h

 We use @since tag in JavaDoc. There exists one issue. For a overloaded 
 method, it inherits the doc from its parent if no JavaDoc is provided. 
 However, if we want to add @since, we have to add JavaDoc. Then we need to 
 copy the JavaDoc from parent, which makes it hard to keep docs in sync.
 A better solution would be implementing @since as an annotation, which is not 
 part of the JavaDoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9846) User guide for Multilayer Perceptron Classifier

2015-08-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9846:


 Summary: User guide for Multilayer Perceptron Classifier
 Key: SPARK-9846
 URL: https://issues.apache.org/jira/browse/SPARK-9846
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Alexander Ulanov






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9814.

   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 1.5.0

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon
Priority: Minor
 Fix For: 1.5.0


 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9824) Internal Accumulators will leak WeakReferences

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9824.

   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.5.0

 Internal Accumulators will leak WeakReferences
 --

 Key: SPARK-9824
 URL: https://issues.apache.org/jira/browse/SPARK-9824
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Blocker
 Fix For: 1.5.0


 InternalAccumulator.create doesn't call `registerAccumulatorForCleanup` to 
 register itself with ContextCleaner, so `WeakReference`s for these 
 accumulators in Accumulators.originals won't be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9776:
-
Priority: Major  (was: Blocker)

[~sthota] Blocker is for committers to set. This does not rise to that level at 
this stage, esp. as there is no target version. Doesn't mean it's not important 
but it's just 'normal' now.

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9789) Reinstate LogisticRegression threshold Param

2015-08-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9789:
-
Shepherd: DB Tsai

 Reinstate LogisticRegression threshold Param
 

 Key: SPARK-9789
 URL: https://issues.apache.org/jira/browse/SPARK-9789
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 From [SPARK-9658]:
 LogisticRegression.threshold was replaced by thresholds, but we could keep 
 threshold for backwards compatibility.  We should add it back, but we 
 should maintain the current semantics whereby thresholds overrides 
 threshold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes

2015-08-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9788.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8077
[https://github.com/apache/spark/pull/8077]

 LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
 -

 Key: SPARK-9788
 URL: https://issues.apache.org/jira/browse/SPARK-9788
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
 Fix For: 1.5.0


 From [SPARK-9658]:
 1. LDA.docConcentration
 It will be nice to keep the old APIs unchanged.  Proposal:
 * Add “asymmetricDocConcentration” and revert docConcentration changes.
 * If the (internal) doc concentration vector is a single value, 
 “getDocConcentration returns it.  If it is a constant vector, 
 getDocConcentration returns the first item, and fails otherwise.
 2. LDAModel.gammaShape
 This should be given a default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692299#comment-14692299
 ] 

Joseph K. Bradley commented on SPARK-7454:
--

If you won't have time, please say so that someone else can take over.  Thanks!

 Perf test for power iteration clustering (PIC)
 --

 Key: SPARK-7454
 URL: https://issues.apache.org/jira/browse/SPARK-7454
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Stephen Boesch





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9848) Add @since tag to new public APIs in 1.5

2015-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9848:
-
Description: We should get a list of new APIs from SPARK-9660. cc: [~fliang]

 Add @since tag to new public APIs in 1.5
 

 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng
  Labels: starter

 We should get a list of new APIs from SPARK-9660. cc: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8971:
---

Assignee: Apache Spark  (was: Seth Hendrickson)

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Apache Spark

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692309#comment-14692309
 ] 

Apache Spark commented on SPARK-8971:
-

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/8112

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-08-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692316#comment-14692316
 ] 

Joseph K. Bradley commented on SPARK-7751:
--

OK I guess we just need to be more careful about the PRs adding since tags.

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 (Do NOT tag private or package private classes or methods.)
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Assignee: Yin Huai

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Yin Huai
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost­-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9427) Add expression functions in SparkR

2015-08-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692600#comment-14692600
 ] 

Yu Ishikawa commented on SPARK-9427:


[~shivaram] After all, I'd like to split this issue to a few sub-issues. Since 
it is quite difficult to add the listed expressions at once. And since it is a 
little hard to review a PR for this issue. I think we could classify them to at 
least three types in SparkR. What do you think?

1. Add expressions whose parameter are only {{(Column)}} or {{(Column, 
Column)}}, like {{md5(e: Column)}}
2. Add expressions whose parameter are a little complicated, like {{conv(num: 
Column, fromBase: Int, toBase: Int)}}
3. Add expressions which are conflicted with the already existing generic, like 
{{coalesce(e: Column*)}}

{{1}} is not a difficult task, extracting method definitions from Scala code. 
And I think we rarely need to consider the confliction with current SparkR code.
However, {{2}} and {{3}} are a little hard because of the complexityomplexity. 
For example, in {{3}}, if we must modify the existing R's generic due to new 
expressions, we should check whether the modification affects the existing code 
or not.

 Add expression functions in SparkR
 --

 Key: SPARK-9427
 URL: https://issues.apache.org/jira/browse/SPARK-9427
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yu Ishikawa

 The list of functions to add is based on SQL's functions. And it would be 
 better to add them in one shot PR.
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9407) Parquet shouldn't fail when pushing down predicates over a column whose underlying Parquet type is an ENUM

2015-08-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9407:
--
Summary: Parquet shouldn't fail when pushing down predicates over a column 
whose underlying Parquet type is an ENUM  (was: Parquet shouldn't push down 
predicates over a column whose underlying Parquet type is an ENUM)

 Parquet shouldn't fail when pushing down predicates over a column whose 
 underlying Parquet type is an ENUM
 --

 Key: SPARK-9407
 URL: https://issues.apache.org/jira/browse/SPARK-9407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 Spark SQL doesn't have an equivalent data type to Parquet {{BINARY (ENUM)}}, 
 and always treats it as a UTF-8 encoded {{StringType}}. Thus, predicate over 
 a Parquet {{ENUM}} column may be pushed down. However, Parquet 1.7.0 and 
 prior versions only support filter push-down optimization for [a limited set 
 of data 
 types|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ValidTypeMap.java#L66-L80],
  and fails the query.
 The simplest solution seems to be upgrading parquet-mr to 1.8.1, which fixes 
 this issue via PARQUET-201



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7165) Sort Merge Join for outer joins

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7165.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Sort Merge Join for outer joins
 ---

 Key: SPARK-7165
 URL: https://issues.apache.org/jira/browse/SPARK-7165
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Adrian Wang
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9730:
---
Parent Issue: SPARK-9697  (was: SPARK-7165)

 Sort Merge Join for Full Outer Join
 ---

 Key: SPARK-9730
 URL: https://issues.apache.org/jira/browse/SPARK-9730
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9730:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 Sort Merge Join for Full Outer Join
 ---

 Key: SPARK-9730
 URL: https://issues.apache.org/jira/browse/SPARK-9730
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9730:
---
Assignee: (was: Josh Rosen)

 Sort Merge Join for Full Outer Join
 ---

 Key: SPARK-9730
 URL: https://issues.apache.org/jira/browse/SPARK-9730
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Josh Rosen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9730:
---
Target Version/s: 1.5.0

 Sort Merge Join for Full Outer Join
 ---

 Key: SPARK-9730
 URL: https://issues.apache.org/jira/browse/SPARK-9730
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Josh Rosen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9829) peakExecutionMemory is not correct

2015-08-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-9829:

Comment: was deleted

(was: How many tasks? peakExecutionMemory in Web UI is the sum of 
peakExecutionMemory in all tasks. This value may be confusing sometimes. E.g., 
assume we have 2 tasks, at 10:00am, task 1's memory usage is 10G, which is 
peak, and it finishes at 10:02am; then task 2 starts at 10:03am, and it reaches 
the peak at 10:04am, which is 10G. Then peakExecutionMemory in Web UI will be 
20G, although we have never used more than 10G.

BTW, did you modify the codes? These values should not be shown directly in Web 
UI.

/cc [~andrewor14])

 peakExecutionMemory is not correct
 --

 Key: SPARK-9829
 URL: https://issues.apache.org/jira/browse/SPARK-9829
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Shixiong Zhu

 When run a query with 8G memory, the peakExecutionMemory in WebUI said that 
 40344371200 (40G).
 Alos there are lots of accumulators with the same name, can't know what do 
 they mean
 {code}
 Accumulable   Value
 number of output rows 439614
 number of output rows 7711
 number of output rows 965
 number of rows7829
 number of rows7711
 number of input rows  965
 number of rows52
 number of input rows  439614
 number of output rows 30
 number of input rows  7726
 number of rows277000
 peakExecutionMemory   40344371200
 number of rows7829
 number of rows965
 number of rows7726
 number of rows30
 number of rows138000
 number of rows8028
 number of rows439614
 number of input rows  30
 {code}
 How to reproduce:
 run TPCDS q19 with scale=5, checkout out the Web UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9849:
---
Description: 
DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
explicitly set the class as a config option, so we must be able to resolve the 
old committer qualified name.



  was:
DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
explicitly set the class as a config option, so we must be able to resolve the 
old committer path.




 DirectParquetOutputCommitter qualified name should be backward compatible
 -

 Key: SPARK-9849
 URL: https://issues.apache.org/jira/browse/SPARK-9849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
 explicitly set the class as a config option, so we must be able to resolve 
 the old committer qualified name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9849:
---
Summary: DirectParquetOutputCommitter qualified name should be backward 
compatible  (was: DirectParquetOutputCommitter path should be backward 
compatible)

 DirectParquetOutputCommitter qualified name should be backward compatible
 -

 Key: SPARK-9849
 URL: https://issues.apache.org/jira/browse/SPARK-9849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
 explicitly set the class, so we must be able to resolve the old committer 
 path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9849:
---
Description: 
DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
explicitly set the class as a config option, so we must be able to resolve the 
old committer path.



  was:
DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
explicitly set the class, so we must be able to resolve the old committer path.




 DirectParquetOutputCommitter qualified name should be backward compatible
 -

 Key: SPARK-9849
 URL: https://issues.apache.org/jira/browse/SPARK-9849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
 explicitly set the class as a config option, so we must be able to resolve 
 the old committer path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9740) first/last aggregate NULL behavior

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9740:
---

Assignee: Yin Huai  (was: Apache Spark)

 first/last aggregate NULL behavior
 --

 Key: SPARK-9740
 URL: https://issues.apache.org/jira/browse/SPARK-9740
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell
Assignee: Yin Huai

 The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
 return the first or last non-null value (if any) found. This is a departure 
 from the behavior of the old FIRST/LAST aggregates and from the 
 FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
 if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
 this behavior for the old UDAF interface.
 Hive makes this behavior configurable, by adding a skipNulls flag. I would 
 suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692392#comment-14692392
 ] 

Apache Spark commented on SPARK-9740:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8113

 first/last aggregate NULL behavior
 --

 Key: SPARK-9740
 URL: https://issues.apache.org/jira/browse/SPARK-9740
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell
Assignee: Yin Huai

 The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
 return the first or last non-null value (if any) found. This is a departure 
 from the behavior of the old FIRST/LAST aggregates and from the 
 FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
 if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
 this behavior for the old UDAF interface.
 Hive makes this behavior configurable, by adding a skipNulls flag. I would 
 suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9740) first/last aggregate NULL behavior

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9740:
---

Assignee: Apache Spark  (was: Yin Huai)

 first/last aggregate NULL behavior
 --

 Key: SPARK-9740
 URL: https://issues.apache.org/jira/browse/SPARK-9740
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell
Assignee: Apache Spark

 The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
 return the first or last non-null value (if any) found. This is a departure 
 from the behavior of the old FIRST/LAST aggregates and from the 
 FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
 if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
 this behavior for the old UDAF interface.
 Hive makes this behavior configurable, by adding a skipNulls flag. I would 
 suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9849:
---

Assignee: Reynold Xin  (was: Apache Spark)

 DirectParquetOutputCommitter qualified name should be backward compatible
 -

 Key: SPARK-9849
 URL: https://issues.apache.org/jira/browse/SPARK-9849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
 explicitly set the class as a config option, so we must be able to resolve 
 the old committer qualified name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9849:
---

Assignee: Apache Spark  (was: Reynold Xin)

 DirectParquetOutputCommitter qualified name should be backward compatible
 -

 Key: SPARK-9849
 URL: https://issues.apache.org/jira/browse/SPARK-9849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Blocker

 DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
 explicitly set the class as a config option, so we must be able to resolve 
 the old committer qualified name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692402#comment-14692402
 ] 

Apache Spark commented on SPARK-9849:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8114

 DirectParquetOutputCommitter qualified name should be backward compatible
 -

 Key: SPARK-9849
 URL: https://issues.apache.org/jira/browse/SPARK-9849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 DirectParquetOutputCommitter was moved in SPARK-9763. However, users can 
 explicitly set the class as a config option, so we must be able to resolve 
 the old committer qualified name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9847:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 ML Params copyValues should copy default values to default map, not set map
 ---

 Key: SPARK-9847
 URL: https://issues.apache.org/jira/browse/SPARK-9847
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Critical

 Currently, Params.copyValues copies default parameter values to the paramMap 
 of the target instance, rather than the defaultParamMap.  It should copy to 
 the defaultParamMap because explicitly setting a parameter can change the 
 semantics.
 This issue arose in [SPARK-9789], where 2 params threshold and thresholds 
 for LogisticRegression can have mutually exclusive values.  If thresholds is 
 set, then fit() will copy the default value of threshold as well, easily 
 resulting in inconsistent settings for the 2 params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9847:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 ML Params copyValues should copy default values to default map, not set map
 ---

 Key: SPARK-9847
 URL: https://issues.apache.org/jira/browse/SPARK-9847
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 Currently, Params.copyValues copies default parameter values to the paramMap 
 of the target instance, rather than the defaultParamMap.  It should copy to 
 the defaultParamMap because explicitly setting a parameter can change the 
 semantics.
 This issue arose in [SPARK-9789], where 2 params threshold and thresholds 
 for LogisticRegression can have mutually exclusive values.  If thresholds is 
 set, then fit() will copy the default value of threshold as well, easily 
 resulting in inconsistent settings for the 2 params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692422#comment-14692422
 ] 

Apache Spark commented on SPARK-9847:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/8115

 ML Params copyValues should copy default values to default map, not set map
 ---

 Key: SPARK-9847
 URL: https://issues.apache.org/jira/browse/SPARK-9847
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 Currently, Params.copyValues copies default parameter values to the paramMap 
 of the target instance, rather than the defaultParamMap.  It should copy to 
 the defaultParamMap because explicitly setting a parameter can change the 
 semantics.
 This issue arose in [SPARK-9789], where 2 params threshold and thresholds 
 for LogisticRegression can have mutually exclusive values.  If thresholds is 
 set, then fit() will copy the default value of threshold as well, easily 
 resulting in inconsistent settings for the 2 params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-11 Thread Stephen Boesch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692437#comment-14692437
 ] 

Stephen Boesch commented on SPARK-7454:
---

Hi,  I had intended to clean this up in the past few days but yes - am
overwhelmed by other tasks. I abdicate.




 Perf test for power iteration clustering (PIC)
 --

 Key: SPARK-7454
 URL: https://issues.apache.org/jira/browse/SPARK-7454
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Stephen Boesch





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9827) Too many open files in TungstenExchange

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692441#comment-14692441
 ] 

Apache Spark commented on SPARK-9827:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8116

 Too many open files in TungstenExchange
 ---

 Key: SPARK-9827
 URL: https://issues.apache.org/jira/browse/SPARK-9827
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Josh Rosen
Priority: Blocker

 When run q19 on TPCDS (scale=5) dataset with 8G memory, it open 10k shuffle 
 files, crash many things (even Chrome).
 {code}
 davies@localhost:~/work/spark$ jps
 95385 Jps
 95316 SparkSubmit
 davies@localhost:~/work/spark$ lsof -p 95316 | wc -l
 9827
 davies@localhost:~/work/spark$ lsof -p 95316 | tail
 java95316 davies 9772r REG1,2  9522 97350739 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/2a/shuffle_0_112_0.data
 java95316 davies 9773r REG1,2  8449 97351388 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/1a/shuffle_0_116_0.data
 java95316 davies 9774r REG1,2  8200 97351134 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/09/shuffle_0_113_0.data
 java95316 davies 9775r REG1,2  8057 97351941 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/05/shuffle_0_117_0.data
 java95316 davies 9776r REG1,2  8565 97351133 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/18/shuffle_0_114_0.data
 java95316 davies 9777r REG1,2  8185 97351942 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/1c/shuffle_0_118_0.data
 java95316 davies 9778r REG1,2  8865 97351135 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/07/shuffle_0_115_0.data
 java95316 davies 9779r REG1,2  8255 97351987 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/3d/shuffle_0_119_0.data
 java95316 davies 9780r REG1,2  8449 97351388 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/1a/shuffle_0_116_0.data
 java95316 davies 9781r REG1,2  9105 97352148 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/13/shuffle_0_120_0.data
 davies@localhost:~/work/spark$ ls -l 
 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-71afa3af-f2a5-4b72-8b2d-45aa70ff7466//3a/
 total 68
 -rw-r--r-- 1 davies staff 8272 Aug 11 09:57 shuffle_0_105_0.data
 -rw-r--r-- 1 davies staff 1608 Aug 11 09:57 shuffle_0_109_0.index
 -rw-r--r-- 1 davies staff 8414 Aug 11 09:57 shuffle_0_127_0.data
 -rw-r--r-- 1 davies staff 8368 Aug 11 09:57 shuffle_0_149_0.data
 -rw-r--r-- 1 davies staff 1608 Aug 11 09:57 shuffle_0_40_0.index
 -rw-r--r-- 1 davies staff 1608 Aug 11 09:57 shuffle_0_62_0.index
 -rw-r--r-- 1 davies staff 7965 Aug 11 09:57 shuffle_0_6_0.data
 -rw-r--r-- 1 davies staff 8419 Aug 11 09:57 shuffle_0_80_0.data
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated

2015-08-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9640.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Do not run Python Kinesis tests when the Kinesis assembly JAR has not been 
 generated
 

 Key: SPARK-9640
 URL: https://issues.apache.org/jira/browse/SPARK-9640
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS

2015-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681331#comment-14681331
 ] 

Cheng Lian edited comment on SPARK-8824 at 8/11/15 6:55 AM:


Oh sorry, I meant to say {{TIMESTAMP_MICROS}} and I mistook your request for 
{{TIMESTAMP_MICROS}}. I'm afraid it's already too late for 1.5. Another thing 
is that, Spark SQL 1.5 now only has microsecond precision, so even if we 
support {{TIMESTAMP_MILLIS}} in 1.6, we'll probably only read Parquet 
{{TIMESTAMP_MILLIS}} values and convert them to microsecond timestamps.


was (Author: lian cheng):
Oh sorry, I mistook your request for {{TIMESTAMP_MICROS}}. I'm afraid it's 
already too late for 1.5. Another thing is that, Spark SQL 1.5 now only has 
microsecond precision, so even if we support {{TIMESTAMP_MILLIS}} in 1.6, we'll 
probably only read Parquet {{TIMESTAMP_MILLIS}} values and convert them to 
microsecond timestamps.

 Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
 ---

 Key: SPARK-8824
 URL: https://issues.apache.org/jira/browse/SPARK-8824
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS

2015-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681331#comment-14681331
 ] 

Cheng Lian commented on SPARK-8824:
---

Oh sorry, I mistook your request for {{TIMESTAMP_MICROS}}. I'm afraid it's 
already too late for 1.5. Another thing is that, Spark SQL 1.5 now only has 
microsecond precision, so even if we support {{TIMESTAMP_MILLIS}} in 1.6, we'll 
probably only read Parquet {{TIMESTAMP_MILLIS}} values and convert them to 
microsecond timestamps.

 Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
 ---

 Key: SPARK-8824
 URL: https://issues.apache.org/jira/browse/SPARK-8824
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9802) spark configuration page should mention spark.executor.cores yarn property

2015-08-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9802.
--
Resolution: Not A Problem

It's documented already, but in the latest docs:

https://spark.apache.org/docs/latest/configuration.html

search for 'spark.executor.cores'. It looks like this got addressed along with 
https://github.com/apache/spark/commit/8f8dc45f6d4c8d7b740eaa3d2ea09d0b531af9dd

 spark configuration page should mention spark.executor.cores yarn property 
 ---

 Key: SPARK-9802
 URL: https://issues.apache.org/jira/browse/SPARK-9802
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
Reporter: nirav patel

 Hi,
 I see that there's --executor-cores arguments available for spark-submit 
 script which internally sets spark.executor.cores. However that property 
 should also be available on configuration page so people who doesn't use 
 spark-submit script know how to set number of cores per executor(container).
 https://spark.apache.org/docs/1.3.1/configuration.html
 Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9727) Make the Kinesis project SBT name and consistent with other streaming projects

2015-08-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9727.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Make the Kinesis project SBT name and consistent with other streaming projects
 --

 Key: SPARK-9727
 URL: https://issues.apache.org/jira/browse/SPARK-9727
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor
 Fix For: 1.5.0


 pom.xml - SBT project name: kinesis-asl --- streaming-kinesis-asl
 SparkBuild - project name: sparkKinesisAsl --- streamingKinesisAsl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9818) Revert 6136, use docker to test JDBC datasources

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9818:
---

Assignee: Apache Spark

 Revert 6136, use docker to test JDBC datasources
 

 Key: SPARK-9818
 URL: https://issues.apache.org/jira/browse/SPARK-9818
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9814:
---

Assignee: (was: Apache Spark)

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Comment: was deleted

(was: I just made it. https://github.com/apache/spark/pull/8096)

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681294#comment-14681294
 ] 

Hyukjin Kwon commented on SPARK-9814:
-

I just made it :)

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8757) Check missing and add user guide for MLlib Python API

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-8757:
---
Description: 
Some MLlib algorithm missing user guide for Python, we need to check and add 
them.
The algorithms that missing user guides for Python are list following. Please 
add it here if you find one more.
* For MLlib
** Isotonic regression (Python example)
** LDA (Python example)
** Streaming k-means (Java/Python examples)
** PCA (Python example)
** SVD (Python example)
** FP-growth (Python example)
* For ML
** feature
*** CountVectorizerModel (user guide)
*** DCT (user guide)
*** MinMaxScaler (user guide)
*** StopWordsRemover (user guide)
*** VectorSlicer (user guide)
*** ElementwiseProduct (python example)

  was:
Some MLlib algorithm missing user guide for Python, we need to check and add 
them.
The algorithms that missing user guides for Python are list following. Please 
add it here if you find one more.
* For MLlib
** Isotonic regression
** LDA
** Streaming k-means
** PCA
** SVD
** FP-growth
* For ML
** feature
*** CountVectorizerModel
*** DCT
*** MinMaxScaler
*** StopWordsRemover
*** VectorSlicer
*** ElementwiseProduct


 Check missing and add user guide for MLlib Python API
 -

 Key: SPARK-8757
 URL: https://issues.apache.org/jira/browse/SPARK-8757
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Some MLlib algorithm missing user guide for Python, we need to check and add 
 them.
 The algorithms that missing user guides for Python are list following. Please 
 add it here if you find one more.
 * For MLlib
 ** Isotonic regression (Python example)
 ** LDA (Python example)
 ** Streaming k-means (Java/Python examples)
 ** PCA (Python example)
 ** SVD (Python example)
 ** FP-growth (Python example)
 * For ML
 ** feature
 *** CountVectorizerModel (user guide)
 *** DCT (user guide)
 *** MinMaxScaler (user guide)
 *** StopWordsRemover (user guide)
 *** VectorSlicer (user guide)
 *** ElementwiseProduct (python example)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9818) Revert 6136, use docker to test JDBC datasources

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681560#comment-14681560
 ] 

Apache Spark commented on SPARK-9818:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8101

 Revert 6136, use docker to test JDBC datasources
 

 Key: SPARK-9818
 URL: https://issues.apache.org/jira/browse/SPARK-9818
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681561#comment-14681561
 ] 

Apache Spark commented on SPARK-6136:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8101

 Docker client library introduces Guava 17.0, which causes runtime binary 
 incompatibilities
 --

 Key: SPARK-6136
 URL: https://issues.apache.org/jira/browse/SPARK-6136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 Integration test suites in the JDBC data source ({{MySQLIntegration}} and 
 {{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively 
 depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary 
 incompatibility issues when Spark is compiled against Hadoop 2.4.
 {code}
 $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 
 -Dhadoop.version=2.4.1
 ...
  sql/test-only *.ParquetDataSourceOffIOSuite
 ...
 [info] ParquetDataSourceOffIOSuite:
 [info] Exception encountered when attempting to run a suite with class name: 
 org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 
 milliseconds)
 [info]   java.lang.IllegalAccessError: tried to access method 
 com.google.common.base.Stopwatch.init()V from class 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat
 [info]   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261)
 [info]   at 
 parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277)
 [info]   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437)
 [info]   at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525)
 [info]   at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
 [info]   at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
 [info]   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797)
 [info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
 [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76)
 

[jira] [Assigned] (SPARK-9818) Revert 6136, use docker to test JDBC datasources

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9818:
---

Assignee: (was: Apache Spark)

 Revert 6136, use docker to test JDBC datasources
 

 Key: SPARK-9818
 URL: https://issues.apache.org/jira/browse/SPARK-9818
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9814:
---

Assignee: Apache Spark

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Assignee: Apache Spark
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681293#comment-14681293
 ] 

Hyukjin Kwon commented on SPARK-9814:
-

I just made it. https://github.com/apache/spark/pull/8096

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681493#comment-14681493
 ] 

Yanbo Liang commented on SPARK-9663:


[~josephkb] I have finished the check, link the existing JIRAs here and close 
the duplicated ones. Thanks!

 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9818) Revert 6136, use docker to test JDBC datasources

2015-08-11 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-9818:
--
Description: (was: https://issues.apache.org/jira/browse/SPARK-6136)

 Revert 6136, use docker to test JDBC datasources
 

 Key: SPARK-9818
 URL: https://issues.apache.org/jira/browse/SPARK-9818
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9818) Revert 6136, use docker to test JDBC datasources

2015-08-11 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-9818:
--
External issue ID:   (was: 6136)

 Revert 6136, use docker to test JDBC datasources
 

 Key: SPARK-9818
 URL: https://issues.apache.org/jira/browse/SPARK-9818
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen

 https://issues.apache.org/jira/browse/SPARK-6136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9818) Revert 6136, use docker to test JDBC datasources

2015-08-11 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-9818:
--
External issue ID: 6136

 Revert 6136, use docker to test JDBC datasources
 

 Key: SPARK-9818
 URL: https://issues.apache.org/jira/browse/SPARK-9818
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen

 https://issues.apache.org/jira/browse/SPARK-6136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9810) Remove individual commit messages from the squash commit message

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9810.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0  (was: 1.6.0)

 Remove individual commit messages from the squash commit message
 

 Key: SPARK-9810
 URL: https://issues.apache.org/jira/browse/SPARK-9810
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 I took a look at the commit messages in git log -- it looks like the 
 individual commit messages are not that useful to include, but do make the 
 commit messages more verbose. They are usually just a bunch of extremely 
 concise descriptions of bug fixes, merges, etc:
 {code}
 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client
 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue
 {code}
 See mailing list discussions: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-Removing-individual-commit-messages-from-the-squash-commit-message-td13295.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests

2015-08-11 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-9817:
--

 Summary: Improve the container placement strategy by considering 
the localities of pending container requests
 Key: SPARK-9817
 URL: https://issues.apache.org/jira/browse/SPARK-9817
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Saisai Shao
Priority: Minor


Current implementation does not consider the localities of pending container 
requests, since required locality preferences of tasks will be shifted time to 
time. It is better to discard outdated container request and recalculate with 
container placement strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8757) Check missing and add user guide for MLlib Python API

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-8757:
---
Comment: was deleted

(was: [~josephkb] Yes, Some of those items do have sections and need updates. I 
have specified more details about the missing.)

 Check missing and add user guide for MLlib Python API
 -

 Key: SPARK-8757
 URL: https://issues.apache.org/jira/browse/SPARK-8757
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Some MLlib algorithm missing user guide for Python, we need to check and add 
 them.
 The algorithms that missing user guides for Python are list following. Please 
 add it here if you find one more.
 * For MLlib
 ** Isotonic regression (Python example)
 ** LDA (Python example)
 ** Streaming k-means (Java/Python examples)
 ** PCA (Python example)
 ** SVD (Python example)
 ** FP-growth (Python example)
 * For ML
 ** feature
 *** CountVectorizerModel (user guide and examples)
 *** DCT (user guide and examples)
 *** MinMaxScaler (user guide and examples)
 *** StopWordsRemover (user guide and examples)
 *** VectorSlicer (user guide and examples)
 *** ElementwiseProduct (python example)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8757) Check missing and add user guide for MLlib Python API

2015-08-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681536#comment-14681536
 ] 

Yanbo Liang commented on SPARK-8757:


[~josephkb] Yes, Some of those items do have sections and need updates. I have 
specified more details about the missing.

 Check missing and add user guide for MLlib Python API
 -

 Key: SPARK-8757
 URL: https://issues.apache.org/jira/browse/SPARK-8757
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Some MLlib algorithm missing user guide for Python, we need to check and add 
 them.
 The algorithms that missing user guides for Python are list following. Please 
 add it here if you find one more.
 * For MLlib
 ** Isotonic regression (Python example)
 ** LDA (Python example)
 ** Streaming k-means (Java/Python examples)
 ** PCA (Python example)
 ** SVD (Python example)
 ** FP-growth (Python example)
 * For ML
 ** feature
 *** CountVectorizerModel (user guide and examples)
 *** DCT (user guide and examples)
 *** MinMaxScaler (user guide and examples)
 *** StopWordsRemover (user guide and examples)
 *** VectorSlicer (user guide and examples)
 *** ElementwiseProduct (python example)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8757) Check missing and add user guide for MLlib Python API

2015-08-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681537#comment-14681537
 ] 

Yanbo Liang commented on SPARK-8757:


[~josephkb] Yes, Some of those items do have sections and need updates. I have 
specified more details about the missing.

 Check missing and add user guide for MLlib Python API
 -

 Key: SPARK-8757
 URL: https://issues.apache.org/jira/browse/SPARK-8757
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Some MLlib algorithm missing user guide for Python, we need to check and add 
 them.
 The algorithms that missing user guides for Python are list following. Please 
 add it here if you find one more.
 * For MLlib
 ** Isotonic regression (Python example)
 ** LDA (Python example)
 ** Streaming k-means (Java/Python examples)
 ** PCA (Python example)
 ** SVD (Python example)
 ** FP-growth (Python example)
 * For ML
 ** feature
 *** CountVectorizerModel (user guide and examples)
 *** DCT (user guide and examples)
 *** MinMaxScaler (user guide and examples)
 *** StopWordsRemover (user guide and examples)
 *** VectorSlicer (user guide and examples)
 *** ElementwiseProduct (python example)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681291#comment-14681291
 ] 

Apache Spark commented on SPARK-9814:
-

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/8096

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681289#comment-14681289
 ] 

Reynold Xin commented on SPARK-9814:


[~hyukjin.kwon] would you like to submit a patch for this?


 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:47 AM:
-

[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns are 
different and (b) a numeric column is mixed into a string column

- The third case still produces an opaque and confusing exception.


was (Author: simeons):
[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse 

[jira] [Resolved] (SPARK-9076) Improve NaN value handling

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9076.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s:   (was: )

 Improve NaN value handling
 --

 Key: SPARK-9076
 URL: https://issues.apache.org/jira/browse/SPARK-9076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 This is an umbrella ticket for handling NaN values.
 For general design, please see 
 https://issues.apache.org/jira/browse/SPARK-9079



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3059) Spark internal module interface design

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-3059.
--
Resolution: Later

Closing this one since I'm not sure whether it is useful to have a long-term 
JIRA ticket like this.


 Spark internal module interface design
 --

 Key: SPARK-3059
 URL: https://issues.apache.org/jira/browse/SPARK-3059
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 An umbrella ticket to track various internal module interface designs  
 implementations for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2456) Scheduler refactoring

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-2456.
--
Resolution: Later

Closing this one since I'm not sure whether it is useful to have a long-term 
JIRA ticket like this.


 Scheduler refactoring
 -

 Key: SPARK-2456
 URL: https://issues.apache.org/jira/browse/SPARK-2456
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Reynold Xin

 This is an umbrella ticket to track scheduler refactoring. We want to clearly 
 define semantics and responsibilities of each component, and define explicit 
 public interfaces for them so it is easier to understand and to contribute 
 (also less buggy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS

2015-08-11 Thread Konstantin Shaposhnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681418#comment-14681418
 ] 

Konstantin Shaposhnikov commented on SPARK-8824:


Ok, thank you for the update.

 Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
 ---

 Key: SPARK-8824
 URL: https://issues.apache.org/jira/browse/SPARK-8824
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9770) Add Python API for ml.feature.DCT

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-9770.
--
Resolution: Duplicate

 Add Python API for ml.feature.DCT
 -

 Key: SPARK-9770
 URL: https://issues.apache.org/jira/browse/SPARK-9770
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python API, user guide and example for ml.feature.DCT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-9771
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-9770
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-9771
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-9771
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9816) Support BinaryType in Concat

2015-08-11 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-9816:
---

 Summary: Support BinaryType in Concat
 Key: SPARK-9816
 URL: https://issues.apache.org/jira/browse/SPARK-9816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Takeshi Yamamuro


Support BinaryType in catalyst Concat according to hive behaviours.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9771) Add Python API for ml.feature.MinMaxScaler

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-9771.
--
Resolution: Duplicate

 Add Python API for ml.feature.MinMaxScaler
 --

 Key: SPARK-9771
 URL: https://issues.apache.org/jira/browse/SPARK-9771
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python API, user guide and example for ml.feature.MinMaxScaler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-9771
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9148) User-facing documentation for NaN handling semantics

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9148:
---
Parent Issue: SPARK-9565  (was: SPARK-9076)

 User-facing documentation for NaN handling semantics
 

 Key: SPARK-9148
 URL: https://issues.apache.org/jira/browse/SPARK-9148
 Project: Spark
  Issue Type: Technical task
  Components: Documentation, SQL
Reporter: Josh Rosen
Priority: Blocker

 Once we've finalized our NaN changes for Spark 1.5, we need to create 
 user-facing documentation to explain our chosen semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8361) Session of ThriftServer is still alive after I exit beeline

2015-08-11 Thread Weizhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681525#comment-14681525
 ] 

Weizhong commented on SPARK-8361:
-

On SparkSQLSessionManager only override the closeSession function which will be 
called by client(may be beeline or others), from the hive(0.13.1) code we know 
beeline have handle the Ctrl+D and !quit which will close the session, but 
don't add shutdown hock, this may be only exit the client but don't close the 
connection.

 Session of ThriftServer is still alive after I exit beeline
 ---

 Key: SPARK-8361
 URL: https://issues.apache.org/jira/browse/SPARK-8361
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: centos6.2 spark-1.4.0
Reporter: cen yuhai

 I connected to thriftserver through beeline, but after I exited beeline(maybe 
 I will use 'ctrl + c' or 'ctrl+z'), it still exited in ThriftServer Web 
 UI(SQL Tab). There are no Finish Time . 
 If I use 'ctrl + d', it will have finish time.
 After reviewing the code, I think the session is still alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9816) Support BinaryType in Concat

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681524#comment-14681524
 ] 

Apache Spark commented on SPARK-9816:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8098

 Support BinaryType in Concat
 

 Key: SPARK-9816
 URL: https://issues.apache.org/jira/browse/SPARK-9816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Takeshi Yamamuro

 Support BinaryType in catalyst Concat according to hive behaviours.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9816) Support BinaryType in Concat

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9816:
---

Assignee: (was: Apache Spark)

 Support BinaryType in Concat
 

 Key: SPARK-9816
 URL: https://issues.apache.org/jira/browse/SPARK-9816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Takeshi Yamamuro

 Support BinaryType in catalyst Concat according to hive behaviours.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9816) Support BinaryType in Concat

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9816:
---

Assignee: Apache Spark

 Support BinaryType in Concat
 

 Key: SPARK-9816
 URL: https://issues.apache.org/jira/browse/SPARK-9816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Takeshi Yamamuro
Assignee: Apache Spark

 Support BinaryType in catalyst Concat according to hive behaviours.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9810) Remove individual commit messages from the squash commit message

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9810:
---
Target Version/s: 1.6.0  (was: 1.5.0)
   Fix Version/s: (was: 1.5.0)
  1.6.0

 Remove individual commit messages from the squash commit message
 

 Key: SPARK-9810
 URL: https://issues.apache.org/jira/browse/SPARK-9810
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.6.0


 I took a look at the commit messages in git log -- it looks like the 
 individual commit messages are not that useful to include, but do make the 
 commit messages more verbose. They are usually just a bunch of extremely 
 concise descriptions of bug fixes, merges, etc:
 {code}
 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client
 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue
 {code}
 See mailing list discussions: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-Removing-individual-commit-messages-from-the-squash-commit-message-td13295.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9636) Treat $SPARK_HOME as write-only

2015-08-11 Thread Philipp Angerer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681403#comment-14681403
 ] 

Philipp Angerer commented on SPARK-9636:


OK, great :)

I see why you think my proposal might be to complex, yet I still think that 
“log file relative to binary” is much more surprising in an environment where 
log files have certain dedicated places.

{{/var/log/}} is something i really expect a system daemon to use for logs. 
{{~/.cache/logs}} is merely the best compromise in absence of a dedicated user 
log directoy. (e.g. {{$XDG_USER_DATA_DIR}} and {{$XDG_USER_CONFIG_DIR}} are 
clear, but there’s no {{$XDG_USER_STATE_DIR}})

i think all this is a consequence of spark not being a good linux citizen. it 
has a {{$SPARK_HOME}}, and relies on it, while there should be a way to run it 
split up to sensible directories: {{/usr/share/spark/}} for data 
{{/usr/lib/spark/}} for shared libraries, {{/usr/lib/pythonx.x/site-packages/}} 
for pyspark, {{/usr/bin/}} for binaries and scripts, {{/etc/spark/}} for 
configs, and {{/var/log/spark}} for logfiles.

 Treat $SPARK_HOME as write-only
 ---

 Key: SPARK-9636
 URL: https://issues.apache.org/jira/browse/SPARK-9636
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.4.1
 Environment: Linux
Reporter: Philipp Angerer
Priority: Minor
  Labels: easyfix

 when starting spark scripts as user and it is installed in a directory the 
 user has no write permissions on, many things work fine, except for the logs 
 (e.g. for {{start-master.sh}})
 logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
 {{$SPARK_HOME/logs}}.
 if installed in this way, it should, instead of throwing an error, write logs 
 to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
 in sequence for writability before trying to use one. i suggest using 
 {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
 → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681473#comment-14681473
 ] 

Sean Owen commented on SPARK-9776:
--

Yeah I see the same. I don't know enough about HiveContext to know if this 
indicates something else is going on, but the error message could at least be 
better. How is your hive-site.xml configured?

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9816) Support BinaryType in Concat

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681531#comment-14681531
 ] 

Apache Spark commented on SPARK-9816:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8099

 Support BinaryType in Concat
 

 Key: SPARK-9816
 URL: https://issues.apache.org/jira/browse/SPARK-9816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Takeshi Yamamuro

 Support BinaryType in catalyst Concat according to hive behaviours.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8757) Check missing and add user guide for MLlib Python API

2015-08-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-8757:
---
Description: 
Some MLlib algorithm missing user guide for Python, we need to check and add 
them.
The algorithms that missing user guides for Python are list following. Please 
add it here if you find one more.
* For MLlib
** Isotonic regression (Python example)
** LDA (Python example)
** Streaming k-means (Java/Python examples)
** PCA (Python example)
** SVD (Python example)
** FP-growth (Python example)
* For ML
** feature
*** CountVectorizerModel (user guide and examples)
*** DCT (user guide and examples)
*** MinMaxScaler (user guide and examples)
*** StopWordsRemover (user guide and examples)
*** VectorSlicer (user guide and examples)
*** ElementwiseProduct (python example)

  was:
Some MLlib algorithm missing user guide for Python, we need to check and add 
them.
The algorithms that missing user guides for Python are list following. Please 
add it here if you find one more.
* For MLlib
** Isotonic regression (Python example)
** LDA (Python example)
** Streaming k-means (Java/Python examples)
** PCA (Python example)
** SVD (Python example)
** FP-growth (Python example)
* For ML
** feature
*** CountVectorizerModel (user guide)
*** DCT (user guide)
*** MinMaxScaler (user guide)
*** StopWordsRemover (user guide)
*** VectorSlicer (user guide)
*** ElementwiseProduct (python example)


 Check missing and add user guide for MLlib Python API
 -

 Key: SPARK-8757
 URL: https://issues.apache.org/jira/browse/SPARK-8757
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Some MLlib algorithm missing user guide for Python, we need to check and add 
 them.
 The algorithms that missing user guides for Python are list following. Please 
 add it here if you find one more.
 * For MLlib
 ** Isotonic regression (Python example)
 ** LDA (Python example)
 ** Streaming k-means (Java/Python examples)
 ** PCA (Python example)
 ** SVD (Python example)
 ** FP-growth (Python example)
 * For ML
 ** feature
 *** CountVectorizerModel (user guide and examples)
 *** DCT (user guide and examples)
 *** MinMaxScaler (user guide and examples)
 *** StopWordsRemover (user guide and examples)
 *** VectorSlicer (user guide and examples)
 *** ElementwiseProduct (python example)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9817:
---

Assignee: (was: Apache Spark)

 Improve the container placement strategy by considering the localities of 
 pending container requests
 

 Key: SPARK-9817
 URL: https://issues.apache.org/jira/browse/SPARK-9817
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Saisai Shao
Priority: Minor

 Current implementation does not consider the localities of pending container 
 requests, since required locality preferences of tasks will be shifted time 
 to time. It is better to discard outdated container request and recalculate 
 with container placement strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests

2015-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9817:
---

Assignee: Apache Spark

 Improve the container placement strategy by considering the localities of 
 pending container requests
 

 Key: SPARK-9817
 URL: https://issues.apache.org/jira/browse/SPARK-9817
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Saisai Shao
Assignee: Apache Spark
Priority: Minor

 Current implementation does not consider the localities of pending container 
 requests, since required locality preferences of tasks will be shifted time 
 to time. It is better to discard outdated container request and recalculate 
 with container placement strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681556#comment-14681556
 ] 

Apache Spark commented on SPARK-9817:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8100

 Improve the container placement strategy by considering the localities of 
 pending container requests
 

 Key: SPARK-9817
 URL: https://issues.apache.org/jira/browse/SPARK-9817
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Saisai Shao
Priority: Minor

 Current implementation does not consider the localities of pending container 
 requests, since required locality preferences of tasks will be shifted time 
 to time. It is better to discard outdated container request and recalculate 
 with container placement strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9809) Task crashes because the internal accumulators are not properly initialized

2015-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9809:
---
Description: 
When a stage failed and another stage was resubmitted with only part of 
partitions to compute, all the tasks failed with error message: 
java.util.NoSuchElementException: key not found: peakExecutionMemory.
This is because the internal accumulators are not properly initialized for this 
stage while other codes assume the internal accumulators always exist.

{code}
Job aborted due to stage failure: Task 4 in stage 12.0 failed 4 times, most 
recent failure: Lost task 4.3 in stage 12.0 (TID 4460, 1
0.1.2.40): java.util.NoSuchElementException: key not found: peakExecutionMemory
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}

  was:
When a stage failed and another stage was resubmitted with only part of 
partitions to compute, all the tasks failed with error message: 
java.util.NoSuchElementException: key not found: peakExecutionMemory.
This is because the internal accumulators are not properly initialized for this 
stage while other codes assume the internal accumulators always exist.

Job aborted due to stage failure: Task 4 in stage 12.0 failed 4 times, most 
recent failure: Lost task 4.3 in stage 12.0 (TID 4460, 1
0.1.2.40): java.util.NoSuchElementException: key not found: peakExecutionMemory
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


 Task crashes because the internal accumulators are not properly initialized
 ---

 Key: SPARK-9809
 URL: https://issues.apache.org/jira/browse/SPARK-9809
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Carson Wang
Priority: Blocker

 When a stage failed and another stage was resubmitted with only part of 
 partitions to compute, all the tasks failed with error message: 
 java.util.NoSuchElementException: key not found: peakExecutionMemory.
 This is because the internal accumulators are not properly initialized for 
 this stage while other codes assume the internal accumulators always exist.
 {code}
 Job aborted due to stage failure: Task 4 in stage 12.0 failed 4 times, most 
 recent failure: Lost task 4.3 in stage 12.0 (TID 4460, 1
 0.1.2.40): java.util.NoSuchElementException: key not found: 
 peakExecutionMemory
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.MapLike$class.apply(MapLike.scala:141)
 at scala.collection.AbstractMap.apply(Map.scala:58)
 at 
 org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
 

[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov commented on SPARK-9813:


[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause it's 
own set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse Completed
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks_aug
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 15/08/11 

[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:46 AM:
-

[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.


was (Author: simeons):
[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause it's 
own set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse 

[jira] [Commented] (SPARK-8724) Need documentation on how to deploy or use SparkR in Spark 1.4.0+

2015-08-11 Thread Vincent Warmerdam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681427#comment-14681427
 ] 

Vincent Warmerdam commented on SPARK-8724:
--

[~shivaram] [~felixcheung] does this still need to be open? or do we want to 
add parts of the rstudio blog post to documentation on the sparkr end?  

 Need documentation on how to deploy or use SparkR in Spark 1.4.0+
 -

 Key: SPARK-8724
 URL: https://issues.apache.org/jira/browse/SPARK-8724
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Felix Cheung
Priority: Minor

 As of now there doesn't seem to be any official documentation on how to 
 deploy SparkR with Spark 1.4.0+
 Also, cluster manager specific documentation (like 
 http://spark.apache.org/docs/latest/spark-standalone.html) does not call out 
 what mode is supported for SparkR and details on deployment steps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >