date:20150524


[ 
https://issues.apache.org/jira/browse/SPARK-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557681#comment-14557681
 ] 

Apache Spark commented on SPARK-7639:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6387

 Add Python API for Statistics.kernelDensity
 ---

 Key: SPARK-7639
 URL: https://issues.apache.org/jira/browse/SPARK-7639
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Add Python API for org.apache.spark.mllib.stat.Statistics.kernelDensity



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7639) Add Python API for Statistics.kernelDensity


 [ 
https://issues.apache.org/jira/browse/SPARK-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7639:
---

Assignee: (was: Apache Spark)

 Add Python API for Statistics.kernelDensity
 ---

 Key: SPARK-7639
 URL: https://issues.apache.org/jira/browse/SPARK-7639
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Add Python API for org.apache.spark.mllib.stat.Statistics.kernelDensity



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7639) Add Python API for Statistics.kernelDensity


 [ 
https://issues.apache.org/jira/browse/SPARK-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7639:
---

Assignee: Apache Spark

 Add Python API for Statistics.kernelDensity
 ---

 Key: SPARK-7639
 URL: https://issues.apache.org/jira/browse/SPARK-7639
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Apache Spark

 Add Python API for org.apache.spark.mllib.stat.Statistics.kernelDensity



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7524) add configs for keytab and principal, move originals to internal

2015-05-24 Thread Tao Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-7524:

Description: 
As spark now supports long running service by updating tokens for namenode, but 
only accept parameters passed with --k=v format which is not very convinient.

I wanna add spark.* configs in properties file and system property

  was:
As spark now supports long running service by updating tokens for namenode, but 
only accept parameters passed with --k=v format which is not very convinient.

I wanna add spark.* configs in properties file and system property, and move 
originals to spark.internal.*.


 add configs for keytab and principal, move originals to internal
 

 Key: SPARK-7524
 URL: https://issues.apache.org/jira/browse/SPARK-7524
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Tao Wang

 As spark now supports long running service by updating tokens for namenode, 
 but only accept parameters passed with --k=v format which is not very 
 convinient.
 I wanna add spark.* configs in properties file and system property



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes


 [ 
https://issues.apache.org/jira/browse/SPARK-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7846:
---

Assignee: Apache Spark

 Use different way to pass spark.yarn.keytab and spark.yarn.principal in 
 different modes
 ---

 Key: SPARK-7846
 URL: https://issues.apache.org/jira/browse/SPARK-7846
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Tao Wang
Assignee: Apache Spark

 --principal and --keytabl options are passed to client but when we started 
 thrift server or spark-shell these two are also passed into the Main class 
 (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and 
 org.apache.spark.repl.Main).
 In these two main class, arguments passed in will be processed with some 3rd 
 libraries, which will lead to some error: Invalid option: --principal or 
 Unrecgnised option: --principal.
 We should pass these command args in different forms, say system properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes


[ 
https://issues.apache.org/jira/browse/SPARK-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557732#comment-14557732
 ] 

Apache Spark commented on SPARK-7846:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/6051

 Use different way to pass spark.yarn.keytab and spark.yarn.principal in 
 different modes
 ---

 Key: SPARK-7846
 URL: https://issues.apache.org/jira/browse/SPARK-7846
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Tao Wang

 --principal and --keytabl options are passed to client but when we started 
 thrift server or spark-shell these two are also passed into the Main class 
 (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and 
 org.apache.spark.repl.Main).
 In these two main class, arguments passed in will be processed with some 3rd 
 libraries, which will lead to some error: Invalid option: --principal or 
 Unrecgnised option: --principal.
 We should pass these command args in different forms, say system properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes


 [ 
https://issues.apache.org/jira/browse/SPARK-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7846:
---

Assignee: (was: Apache Spark)

 Use different way to pass spark.yarn.keytab and spark.yarn.principal in 
 different modes
 ---

 Key: SPARK-7846
 URL: https://issues.apache.org/jira/browse/SPARK-7846
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Tao Wang

 --principal and --keytabl options are passed to client but when we started 
 thrift server or spark-shell these two are also passed into the Main class 
 (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and 
 org.apache.spark.repl.Main).
 In these two main class, arguments passed in will be processed with some 3rd 
 libraries, which will lead to some error: Invalid option: --principal or 
 Unrecgnised option: --principal.
 We should pass these command args in different forms, say system properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute


 [ 
https://issues.apache.org/jira/browse/SPARK-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7809:
---

Assignee: Apache Spark

 MultivariateOnlineSummarizer should allow users to configure what to compute
 

 Key: SPARK-7809
 URL: https://issues.apache.org/jira/browse/SPARK-7809
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 Now MultivariateOnlineSummarizer computes every summary statistics it can 
 provide, which is okay and convenient for small number of features. It the 
 feature dimension is large, this becomes expensive. So we should add setters 
 to allow users to configure what to compute.
 {code}
 val summarizer = new MultivariateOnlineSummarizer()
   .withMean(false)
   .withMax(false)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute


[ 
https://issues.apache.org/jira/browse/SPARK-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557695#comment-14557695
 ] 

Apache Spark commented on SPARK-7809:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6388

 MultivariateOnlineSummarizer should allow users to configure what to compute
 

 Key: SPARK-7809
 URL: https://issues.apache.org/jira/browse/SPARK-7809
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng

 Now MultivariateOnlineSummarizer computes every summary statistics it can 
 provide, which is okay and convenient for small number of features. It the 
 feature dimension is large, this becomes expensive. So we should add setters 
 to allow users to configure what to compute.
 {code}
 val summarizer = new MultivariateOnlineSummarizer()
   .withMean(false)
   .withMax(false)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute


 [ 
https://issues.apache.org/jira/browse/SPARK-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7809:
---

Assignee: (was: Apache Spark)

 MultivariateOnlineSummarizer should allow users to configure what to compute
 

 Key: SPARK-7809
 URL: https://issues.apache.org/jira/browse/SPARK-7809
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng

 Now MultivariateOnlineSummarizer computes every summary statistics it can 
 provide, which is okay and convenient for small number of features. It the 
 feature dimension is large, this becomes expensive. So we should add setters 
 to allow users to configure what to compute.
 {code}
 val summarizer = new MultivariateOnlineSummarizer()
   .withMean(false)
   .withMax(false)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7847) Fix dynamic partition path escaping

2015-05-24 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-7847:
-

 Summary: Fix dynamic partition path escaping
 Key: SPARK-7847
 URL: https://issues.apache.org/jira/browse/SPARK-7847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.3.0, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical


Background: when writing dynamic partitions, partition values are converted to 
string and escaped if necessary. For example, a partition column {{p}} of type 
{{String}} may have a value {{A/B}}, then the corresponding partition directory 
name is escaped into {{p=A%2fB}}.

Currently, there are two issues regarding to dynamic partition path escaping. 
The first issue is that, when reading back partition values, escaped strings 
are not unescaped. This one is easy to fix.

The second issue is more subtle. In [PR 
#5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492]
 we tried to use {{Path.toUri.toString}} to fix an escaping issue related to S3 
credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} also 
escapes {{%}} characters in the path. Thus, using the dynamic partitioning case 
mentioned above, {{p=A%2fB}} is double escaped into {{p=A%252fB}} ({{%}} 
escaped into {{%25}}).

The expected behavior here should be, only escaping the URI user info part (S3 
key and secret) but leave all other components untouched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes

2015-05-24 Thread Tao Wang (JIRA)

Tao Wang created SPARK-7846:
---

 Summary: Use different way to pass spark.yarn.keytab and 
spark.yarn.principal in different modes
 Key: SPARK-7846
 URL: https://issues.apache.org/jira/browse/SPARK-7846
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Tao Wang


--principal and --keytabl options are passed to client but when we started 
thrift server or spark-shell these two are also passed into the Main class 
(org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and 
org.apache.spark.repl.Main).

In these two main class, arguments passed in will be processed with some 3rd 
libraries, which will lead to some error: Invalid option: --principal or 
Unrecgnised option: --principal.

We should pass these command args in different forms, say system properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7847) Fix dynamic partition path escaping


 [ 
https://issues.apache.org/jira/browse/SPARK-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7847:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Fix dynamic partition path escaping
 ---

 Key: SPARK-7847
 URL: https://issues.apache.org/jira/browse/SPARK-7847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Critical

 Background: when writing dynamic partitions, partition values are converted 
 to string and escaped if necessary. For example, a partition column {{p}} of 
 type {{String}} may have a value {{A/B}}, then the corresponding partition 
 directory name is escaped into {{p=A%2fB}}.
 Currently, there are two issues regarding to dynamic partition path escaping. 
 The first issue is that, when reading back partition values, escaped strings 
 are not unescaped. This one is easy to fix.
 The second issue is more subtle. In [PR 
 #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492]
  we tried to use {{Path.toUri.toString}} to fix an escaping issue related to 
 S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} 
 also escapes {{%}} characters in the path. Thus, using the dynamic 
 partitioning case mentioned above, {{p=A%2fB}} is double escaped into 
 {{p=A%252fB}} ({{%}} escaped into {{%25}}).
 The expected behavior here should be, only escaping the URI user info part 
 (S3 key and secret) but leave all other components untouched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7847) Fix dynamic partition path escaping


 [ 
https://issues.apache.org/jira/browse/SPARK-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7847:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Fix dynamic partition path escaping
 ---

 Key: SPARK-7847
 URL: https://issues.apache.org/jira/browse/SPARK-7847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 Background: when writing dynamic partitions, partition values are converted 
 to string and escaped if necessary. For example, a partition column {{p}} of 
 type {{String}} may have a value {{A/B}}, then the corresponding partition 
 directory name is escaped into {{p=A%2fB}}.
 Currently, there are two issues regarding to dynamic partition path escaping. 
 The first issue is that, when reading back partition values, escaped strings 
 are not unescaped. This one is easy to fix.
 The second issue is more subtle. In [PR 
 #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492]
  we tried to use {{Path.toUri.toString}} to fix an escaping issue related to 
 S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} 
 also escapes {{%}} characters in the path. Thus, using the dynamic 
 partitioning case mentioned above, {{p=A%2fB}} is double escaped into 
 {{p=A%252fB}} ({{%}} escaped into {{%25}}).
 The expected behavior here should be, only escaping the URI user info part 
 (S3 key and secret) but leave all other components untouched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7847) Fix dynamic partition path escaping


[ 
https://issues.apache.org/jira/browse/SPARK-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557728#comment-14557728
 ] 

Apache Spark commented on SPARK-7847:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6389

 Fix dynamic partition path escaping
 ---

 Key: SPARK-7847
 URL: https://issues.apache.org/jira/browse/SPARK-7847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 Background: when writing dynamic partitions, partition values are converted 
 to string and escaped if necessary. For example, a partition column {{p}} of 
 type {{String}} may have a value {{A/B}}, then the corresponding partition 
 directory name is escaped into {{p=A%2fB}}.
 Currently, there are two issues regarding to dynamic partition path escaping. 
 The first issue is that, when reading back partition values, escaped strings 
 are not unescaped. This one is easy to fix.
 The second issue is more subtle. In [PR 
 #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492]
  we tried to use {{Path.toUri.toString}} to fix an escaping issue related to 
 S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} 
 also escapes {{%}} characters in the path. Thus, using the dynamic 
 partitioning case mentioned above, {{p=A%2fB}} is double escaped into 
 {{p=A%252fB}} ({{%}} escaped into {{%25}}).
 The expected behavior here should be, only escaping the URI user info part 
 (S3 key and secret) but leave all other components untouched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7535) Audit Pipeline APIs for 1.4


[ 
https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557804#comment-14557804
 ] 

Apache Spark commented on SPARK-7535:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6392

 Audit Pipeline APIs for 1.4
 ---

 Key: SPARK-7535
 URL: https://issues.apache.org/jira/browse/SPARK-7535
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng

 This is an umbrella for auditing the Pipeline (spark.ml) APIs.  Items to 
 check:
 * Public/protected/private access
 * Consistency across spark.ml
 * Classes, methods, and parameters in spark.mllib but missing in spark.ml
 ** We should create JIRAs for each of these (under an umbrella) as to-do 
 items for future releases.
 For each algorithm or API component, create a subtask under this umbrella.  
 Some major new items:
 * new feature transformers
 * tree models
 * elastic-net
 * ML attributes
 * developer APIs (Predictor, Classifier, Regressor)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7805) Move SQLTestUtils.scala form src/main


 [ 
https://issues.apache.org/jira/browse/SPARK-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7805.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6334
[https://github.com/apache/spark/pull/6334]

 Move SQLTestUtils.scala form src/main
 -

 Key: SPARK-7805
 URL: https://issues.apache.org/jira/browse/SPARK-7805
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.4.0


 These trigger binary compatibility issues when changed. In general we 
 shouldn't be putting test code in src/main. If it's needed by multiple 
 modules, IIRC we have a way to do that (look elsewhere in Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7848) Update SparkStreaming docs to include FAQ for knobs


 [ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated SPARK-7848:

Summary: Update SparkStreaming docs to include FAQ for knobs   (was: 
Update SparkStreaming docs to include knobs )

 Update SparkStreaming docs to include FAQ for knobs 
 --

 Key: SPARK-7848
 URL: https://issues.apache.org/jira/browse/SPARK-7848
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: jay vyas

 A recent email on the maligning list detailed a bunch of great knobs to 
 remember for spark streaming. 
 Lets integrate this  into the docs where appropriate.
 I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7848) Update SparkStreaming docs to include FAQ or bullets for knobs.


 [ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated SPARK-7848:

Summary: Update SparkStreaming docs to include FAQ or bullets for knobs.  
(was: Update SparkStreaming docs to include FAQ for knobs )

 Update SparkStreaming docs to include FAQ or bullets for knobs.
 -

 Key: SPARK-7848
 URL: https://issues.apache.org/jira/browse/SPARK-7848
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: jay vyas

 A recent email on the maligning list detailed a bunch of great knobs to 
 remember for spark streaming. 
 Lets integrate this  into the docs where appropriate.
 I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ knobs information.


 [ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated SPARK-7848:

Summary: Update SparkStreaming docs to incorporate FAQ and/or bullets w/ 
knobs information.  (was: Update SparkStreaming docs to include FAQ or 
bullets for knobs.)

 Update SparkStreaming docs to incorporate FAQ and/or bullets w/ knobs 
 information.
 

 Key: SPARK-7848
 URL: https://issues.apache.org/jira/browse/SPARK-7848
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: jay vyas

 A recent email on the maligning list detailed a bunch of great knobs to 
 remember for spark streaming. 
 Lets integrate this  into the docs where appropriate.
 I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-05-24 Thread Peng Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557820#comment-14557820
 ] 

Peng Cheng commented on SPARK-7442:
---

Adding jar won't solve the problem:
you need to set the following parameters:

  --conf 
spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

But in my 2.6 environment the added jar is ignored by worker's classloader for 
unknow reason, see 
http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar

 Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
 -

 Key: SPARK-7442
 URL: https://issues.apache.org/jira/browse/SPARK-7442
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1
 Environment: OS X
Reporter: Nicholas Chammas

 # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
 page|http://spark.apache.org/downloads.html].
 # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
 # Fire up PySpark and try reading from S3 with something like this:
 {code}sc.textFile('s3n://bucket/file_*').count(){code}
 # You will get an error like this:
 {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.io.IOException: No FileSystem for scheme: s3n{code}
 {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
 works.
 It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
 that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-05-24 Thread Peng Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557820#comment-14557820
 ] 

Peng Cheng edited comment on SPARK-7442 at 5/24/15 6:55 PM:


Adding jar won't solve the problem:
you need to set the following parameters:

  --conf 
spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

But in my 2.6 environment the fs implementation in the added jar is ignored by 
worker's classloader for unknow reason, see 
http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar


was (Author: peng):
Adding jar won't solve the problem:
you need to set the following parameters:

  --conf 
spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

But in my 2.6 environment the added jar is ignored by worker's classloader for 
unknow reason, see 
http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar

 Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
 -

 Key: SPARK-7442
 URL: https://issues.apache.org/jira/browse/SPARK-7442
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1
 Environment: OS X
Reporter: Nicholas Chammas

 # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
 page|http://spark.apache.org/downloads.html].
 # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
 # Fire up PySpark and try reading from S3 with something like this:
 {code}sc.textFile('s3n://bucket/file_*').count(){code}
 # You will get an error like this:
 {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.io.IOException: No FileSystem for scheme: s3n{code}
 {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
 works.
 It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
 that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7848) Update SparkStreaming docs to include knobs

jay vyas created SPARK-7848:
---

 Summary: Update SparkStreaming docs to include knobs 
 Key: SPARK-7848
 URL: https://issues.apache.org/jira/browse/SPARK-7848
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: jay vyas


A recent email on the maligning list detailed a bunch of great knobs to 
remember for spark streaming. 

Lets integrate this  into the docs where appropriate.

I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7848) Update SparkStreaming docs to include knobs

[
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557756#comment-14557756
]

jay vyas commented on SPARK-7848:
-

COPIED from the ASF Mailing list thread for convenience.

{noformat}
Blocks are replicated immediately, before the driver launches any jobs using
them.

On Thu, May 21, 2015 at 2:05 AM, Hemant Bhanawat hemant9...@gmail.com wrote:
Honestly, given the length of my email, I didn't expect a reply. :-) Thanks for
reading and replying. However, I have a follow-up question:

I don't think if I understand the block replication completely. Are the blocks
replicated immediately after they are received by the receiver? Or are they
kept on the receiver node only and are moved only on shuffle? Has the
replication something to do with locality.wait?

Thanks,
Hemant

On Thu, May 21, 2015 at 2:21 AM, Tathagata Das t...@databricks.com wrote:
Correcting the ones that are incorrect or incomplete. BUT this is good list for
things to remember about Spark Streaming.

On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote:
Hi,

I have compiled a list (from online sources) of knobs/design considerations
that need to be taken care of by applications running on spark streaming. Is my
understanding correct? Any other important design consideration that I should
take care of?

A DStream is associated with a single receiver. For attaining read parallelism
multiple receivers i.e. multiple DStreams need to be created.
A receiver is run within an executor. It occupies one core. Ensure that there
are enough cores for processing after receiver slots are booked i.e.
spark.cores.max should take the receiver slots into account.
The receivers are allocated to executors in a round robin fashion.
When data is received from a stream source, receiver creates blocks of data. A
new block of data is generated every blockInterval milliseconds. N blocks of
data are created during the batchInterval where N = batchInterval/blockInterval.
These blocks are distributed by the BlockManager of the current executor to the
block managers of other executors. After that, the Network Input Tracker
running on the driver is informed about the block locations for further
processing.
A RDD is created on the driver for the blocks created during the batchInterval.
The blocks generated during the batchInterval are partitions of the RDD. Each
partition is a task in spark. blockInterval== batchinterval would mean that a
single partition is created and probably it is processed locally.
The map tasks on the blocks are processed in the executors (one that received
the block, and another where the block was replicated) that has the blocks
irrespective of block interval, unless non-local scheduling kicks in (as you
observed next).
Having bigger blockinterval means bigger blocks. A high value of
spark.locality.wait increases the chance of processing a block on the local
node. A balance needs to be found out between these two parameters to ensure
that the bigger blocks are processed locally.
Instead of relying on batchInterval and blockInterval, you can define the
number of partitions by calling dstream.repartition(n). This reshuffles the
data in RDD randomly to create n number of partitions.
Yes, for greater parallelism. Though comes at the cost of a shuffle.
An RDD's processing is scheduled by driver's jobscheduler as a job. At a given
point of time only one job is active. So, if one job is executing the other
jobs are queued.
If you have two dstreams there will be two RDDs formed and there will be two
jobs created which will be scheduled one after the another.
To avoid this, you can union two dstreams. This will ensure that a single
unionRDD is formed for the two RDDs of the dstreams. This unionRDD is then
considered as a single job. However the partitioning of the RDDs is not
impacted.
To further clarify, the jobs depend on the number of output operations (print,
foreachRDD, saveAsXFiles) and the number of RDD actions in those output
operations.

dstream1.union(dstream2).foreachRDD { rdd = rdd.count() }// one Spark job
per batch

dstream1.union(dstream2).foreachRDD { rdd = { rdd.count() ; rdd.count() } }
// TWO Spark jobs per batch

dstream1.foreachRDD { rdd = rdd.count } ; dstream2.foreachRDD { rdd =
rdd.count } // TWO Spark jobs per batch

If the batch processing time is more than batchinterval then obviously the
receiver's memory will start filling up and will end up in throwing exceptions
(most probably BlockNotFoundException). Currently there is no way to pause the
receiver.
You can limit the rate of receiver using SparkConf config
spark.streaming.receiver.maxRate
For being fully fault tolerant, spark streaming needs to enable checkpointing.
Checkpointing increases the batch processing time.
Incomplete. There are two types of checkpointing - data

[jira] [Resolved] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7845.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6384
[https://github.com/apache/spark/pull/6384]

 Bump Hadoop 1 tests to version 1.2.1
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.4.0


 A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.1 (both versions are extremely old). If users are still on  1.2.1 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7805) Move SQLTestUtils.scala form src/main


[ 
https://issues.apache.org/jira/browse/SPARK-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557793#comment-14557793
 ] 

Apache Spark commented on SPARK-7805:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6391

 Move SQLTestUtils.scala form src/main
 -

 Key: SPARK-7805
 URL: https://issues.apache.org/jira/browse/SPARK-7805
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.4.0


 These trigger binary compatibility issues when changed. In general we 
 shouldn't be putting test code in src/main. If it's needed by multiple 
 modules, IIRC we have a way to do that (look elsewhere in Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7849) Update Spark SQL Hive support documentation for 1.4

2015-05-24 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-7849:
-

 Summary: Update Spark SQL Hive support documentation for 1.4
 Key: SPARK-7849
 URL: https://issues.apache.org/jira/browse/SPARK-7849
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Priority: Critical


Hive support contents need to be updated for 1.4. Most importantly, after 
introducing the isolated classloader mechanism in 1.4, the following questions 
need to be clarified:

# How to enable Hive support
# What versions of Hive are supported
# How to specify metastore version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4


 [ 
https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7843:

Priority: Major  (was: Critical)

 Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
 ---

 Key: SPARK-7843
 URL: https://issues.apache.org/jira/browse/SPARK-7843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
 Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, 
 HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out


 The following tests are failing all the time (starting from 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/)
 {code}
 Test Result (8 failures / +8)
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 
 regression: result set containing NULL
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 
 regression: result set iterator issue
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 
 regression: Date type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 
 regression: Complex type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test 
 multiple session
 org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4


 [ 
https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7843:

Summary: Several thrift server failures in Spark 1.4 sbt build with hadoop 
1.0.4  (was: Several thrift server failures in Spark 1.4 sbt build with hadoop 
1)

 Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
 ---

 Key: SPARK-7843
 URL: https://issues.apache.org/jira/browse/SPARK-7843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Priority: Critical
 Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, 
 HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out


 The following tests are failing all the time (starting from 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/)
 {code}
 Test Result (8 failures / +8)
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 
 regression: result set containing NULL
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 
 regression: result set iterator issue
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 
 regression: Date type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 
 regression: Complex type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test 
 multiple session
 org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7811) Fix typo on slf4j configuration on metrics.properties.template

2015-05-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7811:
-
Assignee: Judy Nash

 Fix typo on slf4j configuration on metrics.properties.template
 --

 Key: SPARK-7811
 URL: https://issues.apache.org/jira/browse/SPARK-7811
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Judy Nash
Assignee: Judy Nash
Priority: Trivial
 Fix For: 1.5.0


 There are a minor typo on slf4jsink configuration at 
 metrics.properties.template. 
 slf4j is mispelled as sl4j on 2 of the configuration. 
 Correcting the typo so users' custom settings will be loaded correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1

2015-05-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557844#comment-14557844
 ] 

Sean Owen commented on SPARK-7843:
--

Does https://issues.apache.org/jira/browse/SPARK-7845 resolve this then?

 Several thrift server failures in Spark 1.4 sbt build with hadoop 1
 ---

 Key: SPARK-7843
 URL: https://issues.apache.org/jira/browse/SPARK-7843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Priority: Critical
 Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, 
 HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out


 The following tests are failing all the time (starting from 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/)
 {code}
 Test Result (8 failures / +8)
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 
 regression: result set containing NULL
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 
 regression: result set iterator issue
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 
 regression: Date type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 
 regression: Complex type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test 
 multiple session
 org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4


[ 
https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557848#comment-14557848
 ] 

Yin Huai commented on SPARK-7843:
-

yeah, I think we can resolve this one. After the investigation, I feel it will 
be really hard to make thrift server work with hadoop 1.0.4. 

[~chenghao] If you have any idea on workaround, feel free to comment at here. 

 Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
 ---

 Key: SPARK-7843
 URL: https://issues.apache.org/jira/browse/SPARK-7843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
 Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, 
 HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out


 The following tests are failing all the time (starting from 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/)
 {code}
 Test Result (8 failures / +8)
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 
 regression: result set containing NULL
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 
 regression: result set iterator issue
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 
 regression: Date type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 
 regression: Complex type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test 
 multiple session
 org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4


 [ 
https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7843.
-
Resolution: Not A Problem

Since we just bumped the hadoop version in hadoop 1 build and these tests work 
again, I am resolving this one as Not A Problem.

 Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
 ---

 Key: SPARK-7843
 URL: https://issues.apache.org/jira/browse/SPARK-7843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
 Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, 
 HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out


 The following tests are failing all the time (starting from 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/)
 {code}
 Test Result (8 failures / +8)
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 
 regression: result set containing NULL
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 
 regression: result set iterator issue
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 
 regression: Date type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 
 regression: Complex type support
 org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test 
 multiple session
 org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query 
 execution
 org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7811) Fix typo on slf4j configuration on metrics.properties.template

2015-05-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7811.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6362
[https://github.com/apache/spark/pull/6362]

 Fix typo on slf4j configuration on metrics.properties.template
 --

 Key: SPARK-7811
 URL: https://issues.apache.org/jira/browse/SPARK-7811
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Judy Nash
Priority: Trivial
 Fix For: 1.5.0


 There are a minor typo on slf4jsink configuration at 
 metrics.properties.template. 
 slf4j is mispelled as sl4j on 2 of the configuration. 
 Correcting the typo so users' custom settings will be loaded correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7851) SparkSQL cli built against Hive 0.13 throws exception when using with Hive 0.12 HCat

2015-05-24 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created SPARK-7851:


 Summary: SparkSQL cli built against Hive 0.13 throws exception 
when using with Hive 0.12 HCat
 Key: SPARK-7851
 URL: https://issues.apache.org/jira/browse/SPARK-7851
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheolsoo Park
Priority: Minor


I built Spark with {{Hive 0.13}} and set the following properties-
{code}
spark.sql.hive.metastore.version=0.12.0
spark.sql.hive.metastore.jars=path_to_hive_0.12_jars
{code}
But when the SparkSQL CLI starts up, I get the following error-
{code}
15/05/24 05:03:29 WARN RetryingMetaStoreClient: MetaStoreClient lost 
connection. Attempting to reconnect.
org.apache.thrift.TApplicationException: Invalid method name: 'get_functions'
at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy12.getFunctions(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662)
at 
org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:175)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
{code}
What's happening is that when SparkSQL Cli starts up, it tries to fetch 
permanent udfs from Hive metastore (due to HIVE-6330, which was introduced in 
Hive 0.13). But then, it ends up invoking an incompatible thrift function that 
doesn't exist in Hive 0.12.

To work around this error, I have to comment out the following line of code-
https://goo.gl/wcfnH1





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-24 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911
 ] 

Sandy Ryza edited comment on SPARK-7699 at 5/25/15 1:26 AM:


[~sowen] I think the possible flaw in your argument is that it relies on 
initial load being defined in some reasonable way.

I.e. I think the worry is that the following can happen:
* initial = 3 and min = 1
* cluster is large and uncontended
* first line of user code is a job submission that can make use of at least 3
* because the executor allocation thread starts immediately, requested 
executors ramps down to 1 before the user code has a chance to submit the job

Which is to say: what guarantees do we provide about initialExecutors other 
than that it's the number of executors requests we have before some opaque 
internal thing happens to adjust it down?  One possible such guarantee we could 
provide is that we won't adjust down for some fixed number of seconds after the 
SparkContext starts.


was (Author: sandyr):
[~sowen] I think the possible flaw in your argument is that it relies on 
initial load being defined in some reasonable.

I.e. I think the worry is that the following can happen:
* initial = 3 and min = 1
* cluster is large and uncontended
* first line of user code is a job submission that can make use of at least 3
* because the executor allocation thread starts immediately, requested 
executors ramps down to 1 before the user code has a chance to submit the job

Which is to say: what guarantees do we provide about initialExecutors other 
than that it's the number of executors requests we have before some opaque 
internal thing happens to adjust it down?  One possible such guarantee we could 
provide is that we won't adjust down for some fixed number of seconds after the 
SparkContext starts.

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-24 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911
 ] 

Sandy Ryza commented on SPARK-7699:
---

[~sowen] I think the possible flaw in your argument is that it relies on 
initial load being defined in some reasonable.

I.e. I think the worry is that the following can happen:
* initial = 3 and min = 1
* cluster is large and uncontended
* first line of user code is a job submission that can make use of at least 3
* because the executor allocation thread starts immediately, requested 
executors ramps down to 1 before the user code has a chance to submit the job

Which is to say: what guarantees do we provide about initialExecutors other 
than that it's the number of executors requests we have before some opaque 
internal thing happens to adjust it down?  One possible such guarantee we could 
provide is that we won't adjust down for some fixed number of seconds after the 
SparkContext starts.

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7850) Hive 0.12.0 profile in POM should be removed


[ 
https://issues.apache.org/jira/browse/SPARK-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557950#comment-14557950
 ] 

Apache Spark commented on SPARK-7850:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/6393

 Hive 0.12.0 profile in POM should be removed
 

 Key: SPARK-7850
 URL: https://issues.apache.org/jira/browse/SPARK-7850
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.4.0
Reporter: Cheolsoo Park
Priority: Minor

 Spark 1.4 supports the multiple metastore versions in a single build 
 (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} 
 is no longer needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7850) Hive 0.12.0 profile in POM should be removed


 [ 
https://issues.apache.org/jira/browse/SPARK-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7850:
---

Assignee: (was: Apache Spark)

 Hive 0.12.0 profile in POM should be removed
 

 Key: SPARK-7850
 URL: https://issues.apache.org/jira/browse/SPARK-7850
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.4.0
Reporter: Cheolsoo Park
Priority: Minor

 Spark 1.4 supports the multiple metastore versions in a single build 
 (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} 
 is no longer needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6784) Make sure values of partitioning columns are correctly converted based on their data types

2015-05-24 Thread Adrian Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557936#comment-14557936
 ] 

Adrian Wang commented on SPARK-6784:


oh, didn't notice this ticket has changed... I filed another jira at SPARK-7790
https://github.com/apache/spark/pull/6318

 Make sure values of partitioning columns are correctly converted based on 
 their data types
 --

 Key: SPARK-6784
 URL: https://issues.apache.org/jira/browse/SPARK-6784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Adrian Wang
Priority: Blocker

 We used to have the problems that values of partitioning columns are not 
 correctly cast to the desired Spark SQL values based on their data types. 
 Let's make sure we correctly do that for both Hive's partitions and 
 HadoopFSRelation's partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7850) Hive 0.12.0 profile in POM should be removed

2015-05-24 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created SPARK-7850:


 Summary: Hive 0.12.0 profile in POM should be removed
 Key: SPARK-7850
 URL: https://issues.apache.org/jira/browse/SPARK-7850
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.4.0
Reporter: Cheolsoo Park
Priority: Minor


Spark 1.4 supports the multiple metastore versions in a single build 
(hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} is 
no longer needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7850) Hive 0.12.0 profile in POM should be removed


 [ 
https://issues.apache.org/jira/browse/SPARK-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7850:
---

Assignee: Apache Spark

 Hive 0.12.0 profile in POM should be removed
 

 Key: SPARK-7850
 URL: https://issues.apache.org/jira/browse/SPARK-7850
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.4.0
Reporter: Cheolsoo Park
Assignee: Apache Spark
Priority: Minor

 Spark 1.4 supports the multiple metastore versions in a single build 
 (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} 
 is no longer needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7844) Broken tests in KernelDensity

2015-05-24 Thread Manoj Kumar (JIRA)

Manoj Kumar created SPARK-7844:
--

 Summary: Broken tests in KernelDensity
 Key: SPARK-7844
 URL: https://issues.apache.org/jira/browse/SPARK-7844
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Manoj Kumar


The densities in KernelDensity are scaled down by (number of parallel processes 
X number of points). This results in broken tests in KernelDensitySuite which 
haven't been tested properly. I think it should just be scaled down by (number 
of samples, i.e number of gaussian distributions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7844) Broken tests in KernelDensity


 [ 
https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7844:
---

Assignee: (was: Apache Spark)

 Broken tests in KernelDensity
 -

 Key: SPARK-7844
 URL: https://issues.apache.org/jira/browse/SPARK-7844
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Manoj Kumar

 The densities in KernelDensity are scaled down by (number of parallel 
 processes X number of points). This results in broken tests in 
 KernelDensitySuite which haven't been tested properly. I think it should just 
 be scaled down by (number of samples, i.e number of gaussian distributions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7844) Broken tests in KernelDensity

2015-05-24 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557650#comment-14557650
 ] 

Manoj Kumar commented on SPARK-7844:


ping [~josephkb]

 Broken tests in KernelDensity
 -

 Key: SPARK-7844
 URL: https://issues.apache.org/jira/browse/SPARK-7844
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Manoj Kumar

 The densities in KernelDensity are scaled down by (number of parallel 
 processes X number of points). This results in broken tests in 
 KernelDensitySuite which haven't been tested properly. I think it should just 
 be scaled down by (number of samples, i.e number of gaussian distributions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7844) Broken tests in KernelDensity


 [ 
https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7844:
---

Assignee: Apache Spark

 Broken tests in KernelDensity
 -

 Key: SPARK-7844
 URL: https://issues.apache.org/jira/browse/SPARK-7844
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Manoj Kumar
Assignee: Apache Spark

 The densities in KernelDensity are scaled down by (number of parallel 
 processes X number of points). This results in broken tests in 
 KernelDensitySuite which haven't been tested properly. I think it should just 
 be scaled down by (number of samples, i.e number of gaussian distributions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7844) Broken tests in KernelDensity


[ 
https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557651#comment-14557651
 ] 

Apache Spark commented on SPARK-7844:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6383

 Broken tests in KernelDensity
 -

 Key: SPARK-7844
 URL: https://issues.apache.org/jira/browse/SPARK-7844
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Manoj Kumar

 The densities in KernelDensity are scaled down by (number of parallel 
 processes X number of points). This results in broken tests in 
 KernelDensitySuite which haven't been tested properly. I think it should just 
 be scaled down by (number of samples, i.e number of gaussian distributions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.0

2015-05-24 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7845:
---
Description: 
A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It 
appears this is one cause of SPARK-7843 since some Hive code relies on newer 
Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 
(both versions are extremely old). If users are still on  1.2.0 and run into 
some of these corner cases, we can consider doing some engineering and 
supporting the older versions. I'd like to bump our test version though and let 
this be driven by users, if they exist.

https://github.com/apache/spark/blob/master/dev/run-tests#L43

  was:A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It 
appears this is one cause of SPARK-7843 since some Hive code relies on newer 
Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 
(both versions are extremely old). If users are still on  1.2.0 and run into 
some of these corner cases, we can consider doing some engineering and 
supporting the older versions. I'd like to bump our test version though and let 
this be driven by users, if they exist.


 Bump Hadoop 1 tests to version 1.2.0
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Priority: Critical

 A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.0 (both versions are extremely old). If users are still on  1.2.0 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.0

2015-05-24 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7845:
---
Assignee: Yin Huai

 Bump Hadoop 1 tests to version 1.2.0
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical

 A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.0 (both versions are extremely old). If users are still on  1.2.0 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7845:

Summary: Bump Hadoop 1 tests to version 1.2.1  (was: Bump Hadoop 1 
tests to version 1.2.0)

 Bump Hadoop 1 tests to version 1.2.1
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical

 A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.0 (both versions are extremely old). If users are still on  1.2.0 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7845:
---

Assignee: Yin Huai  (was: Apache Spark)

 Bump Hadoop 1 tests to version 1.2.1
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical

 A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.1 (both versions are extremely old). If users are still on  1.2.1 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1


[ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557660#comment-14557660
 ] 

Apache Spark commented on SPARK-7845:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6384

 Bump Hadoop 1 tests to version 1.2.1
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Yin Huai
Priority: Critical

 A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.1 (both versions are extremely old). If users are still on  1.2.1 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7845:
---

Assignee: Apache Spark  (was: Yin Huai)

 Bump Hadoop 1 tests to version 1.2.1
 --

 Key: SPARK-7845
 URL: https://issues.apache.org/jira/browse/SPARK-7845
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Apache Spark
Priority: Critical

 A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It 
 appears this is one cause of SPARK-7843 since some Hive code relies on newer 
 Hadoop API's. My feeling is we should just bump our tested version up to 
 1.2.1 (both versions are extremely old). If users are still on  1.2.1 and 
 run into some of these corner cases, we can consider doing some engineering 
 and supporting the older versions. I'd like to bump our test version though 
 and let this be driven by users, if they exist.
 https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7832) Always run SQL tests in master build.


 [ 
https://issues.apache.org/jira/browse/SPARK-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7832:
---

Assignee: Yin Huai  (was: Apache Spark)

 Always run SQL tests in master build.
 -

 Key: SPARK-7832
 URL: https://issues.apache.org/jira/browse/SPARK-7832
 Project: Spark
  Issue Type: Task
  Components: Build, SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Our master build does not run Hive compatibility tests. We need to enable 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7832) Always run SQL tests in master build.


 [ 
https://issues.apache.org/jira/browse/SPARK-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7832:
---

Assignee: Apache Spark  (was: Yin Huai)

 Always run SQL tests in master build.
 -

 Key: SPARK-7832
 URL: https://issues.apache.org/jira/browse/SPARK-7832
 Project: Spark
  Issue Type: Task
  Components: Build, SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Critical

 Our master build does not run Hive compatibility tests. We need to enable 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7832) Always run SQL tests in master build.


[ 
https://issues.apache.org/jira/browse/SPARK-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557661#comment-14557661
 ] 

Apache Spark commented on SPARK-7832:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6385

 Always run SQL tests in master build.
 -

 Key: SPARK-7832
 URL: https://issues.apache.org/jira/browse/SPARK-7832
 Project: Spark
  Issue Type: Task
  Components: Build, SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Our master build does not run Hive compatibility tests. We need to enable 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized


 [ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7780:
---

Assignee: (was: Apache Spark)

 The intercept in LogisticRegressionWithLBFGS should not be regularized
 --

 Key: SPARK-7780
 URL: https://issues.apache.org/jira/browse/SPARK-7780
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 The intercept in Logistic Regression represents a prior on categories which 
 should not be regularized. In MLlib, the regularization is handled through 
 `Updater`, and the `Updater` penalizes all the components without excluding 
 the intercept which resulting poor training accuracy with regularization.
 The new implementation in ML framework handles this properly, and we should 
 call the implementation in ML from MLlib since majority of users are still 
 using MLlib api. 
 Note that both of them are doing feature scalings to improve the convergence, 
 and the only difference is ML version doesn't regularize the intercept. As a 
 result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized


 [ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7780:
---

Assignee: Apache Spark

 The intercept in LogisticRegressionWithLBFGS should not be regularized
 --

 Key: SPARK-7780
 URL: https://issues.apache.org/jira/browse/SPARK-7780
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai
Assignee: Apache Spark

 The intercept in Logistic Regression represents a prior on categories which 
 should not be regularized. In MLlib, the regularization is handled through 
 `Updater`, and the `Updater` penalizes all the components without excluding 
 the intercept which resulting poor training accuracy with regularization.
 The new implementation in ML framework handles this properly, and we should 
 call the implementation in ML from MLlib since majority of users are still 
 using MLlib api. 
 Note that both of them are doing feature scalings to improve the convergence, 
 and the only difference is ML version doesn't regularize the intercept. As a 
 result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized