date:20150521

[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555680#comment-14555680
 ] 

Paul Wu commented on SPARK-7804:


Unfortunately,  JdbcRDD was poorly designed since the lowerbound and upperbound 
are long types which are too limited.  One of my team member implemented a 
general one  based on the idea. Some of my team are worried about the home-made 
solution.  When we saw JDBCRDD, it looks like what we wanted. In fact, I hope 
JDBCRDD can be public or JdbcRDD can be  re-designed to take care general 
situation just like what JDBCRDD does.



> Incorrect results from JDBCRDD -- one record repeatly
> -
>
> Key: SPARK-7804
> URL: https://issues.apache.org/jira/browse/SPARK-7804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Paul Wu
>
> Getting only one  record repeated in the RDD and repeated field value:
>  
> I have a table like:
> {code}
> attuid  name email
> 12  john   j...@appp.com
> 23  tom   t...@appp.com
> 34  tony  t...@appp.com
> {code}
> My code:
> {code}
>  JavaSparkContext sc = new JavaSparkContext(sparkConf);
> String url = "";
> java.util.Properties prop = new Properties();
> List partitionList = new ArrayList<>();
> //int i;
> partitionList.add(new JDBCPartition("1=1", 0));
> 
> List fields = new ArrayList();
> fields.add(DataTypes.createStructField("attuid", 
> DataTypes.StringType, true));
> fields.add(DataTypes.createStructField("name", DataTypes.StringType, 
> true));
> fields.add(DataTypes.createStructField("email", DataTypes.StringType, 
> true));
> StructType schema = DataTypes.createStructType(fields);
> JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
> JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop),
>  
> schema,
> " USERS",
> new String[]{"attuid", "name", "email"},
> new Filter[]{ },
> 
> partitionList.toArray(new JDBCPartition[0])
>   
> );
> 
> System.out.println("count before to Java RDD=" + 
> jdbcRDD.cache().count());
> JavaRDD jrdd = jdbcRDD.toJavaRDD();
> System.out.println("count=" + jrdd.count());
> List lr = jrdd.collect();
> for (Row r : lr) {
> for (int ii = 0; ii < r.length(); ii++) {
> System.out.println(r.getString(ii));
> }
> }
> {code}
> ===
> result is :
> {code}
> 34
> tony
>  t...@appp.com
> 34
> tony
>  t...@appp.com
> 34
> tony 
>  t...@appp.com
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7605) Python API for ElementwiseProduct

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555669#comment-14555669
 ] 

Apache Spark commented on SPARK-7605:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6346

> Python API for ElementwiseProduct
> -
>
> Key: SPARK-7605
> URL: https://issues.apache.org/jira/browse/SPARK-7605
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Python API for org.apache.spark.mllib.feature.ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7605) Python API for ElementwiseProduct

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7605:
---

Assignee: (was: Apache Spark)

> Python API for ElementwiseProduct
> -
>
> Key: SPARK-7605
> URL: https://issues.apache.org/jira/browse/SPARK-7605
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Python API for org.apache.spark.mllib.feature.ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7605) Python API for ElementwiseProduct

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7605:
---

Assignee: Apache Spark

> Python API for ElementwiseProduct
> -
>
> Key: SPARK-7605
> URL: https://issues.apache.org/jira/browse/SPARK-7605
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Python API for org.apache.spark.mllib.feature.ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7221:
---

Assignee: (was: Apache Spark)

> Expose the current processed file name of FileInputDStream to the users
> ---
>
> Key: SPARK-7221
> URL: https://issues.apache.org/jira/browse/SPARK-7221
> Project: Spark
>  Issue Type: Wish
>  Components: Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> This is a wished feature from Spark user list 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
>  Currently there's no API to get the processed file name for 
> FileInputDStream, it is useful if we can expose this to the users. 
> The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7221:
---

Assignee: Apache Spark

> Expose the current processed file name of FileInputDStream to the users
> ---
>
> Key: SPARK-7221
> URL: https://issues.apache.org/jira/browse/SPARK-7221
> Project: Spark
>  Issue Type: Wish
>  Components: Streaming
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> This is a wished feature from Spark user list 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
>  Currently there's no API to get the processed file name for 
> FileInputDStream, it is useful if we can expose this to the users. 
> The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555668#comment-14555668
 ] 

Apache Spark commented on SPARK-7221:
-

User 'animeshbaranawal' has created a pull request for this issue:
https://github.com/apache/spark/pull/6347

> Expose the current processed file name of FileInputDStream to the users
> ---
>
> Key: SPARK-7221
> URL: https://issues.apache.org/jira/browse/SPARK-7221
> Project: Spark
>  Issue Type: Wish
>  Components: Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> This is a wished feature from Spark user list 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
>  Currently there's no API to get the processed file name for 
> FileInputDStream, it is useful if we can expose this to the users. 
> The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7804:
--
 Flags:   (was: Important)
Labels:   (was: JDBCRDD sql)

> Incorrect results from JDBCRDD -- one record repeatly
> -
>
> Key: SPARK-7804
> URL: https://issues.apache.org/jira/browse/SPARK-7804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Paul Wu
>
> Getting only one  record repeated in the RDD and repeated field value:
>  
> I have a table like:
> {code}
> attuid  name email
> 12  john   j...@appp.com
> 23  tom   t...@appp.com
> 34  tony  t...@appp.com
> {code}
> My code:
> {code}
>  JavaSparkContext sc = new JavaSparkContext(sparkConf);
> String url = "";
> java.util.Properties prop = new Properties();
> List partitionList = new ArrayList<>();
> //int i;
> partitionList.add(new JDBCPartition("1=1", 0));
> 
> List fields = new ArrayList();
> fields.add(DataTypes.createStructField("attuid", 
> DataTypes.StringType, true));
> fields.add(DataTypes.createStructField("name", DataTypes.StringType, 
> true));
> fields.add(DataTypes.createStructField("email", DataTypes.StringType, 
> true));
> StructType schema = DataTypes.createStructType(fields);
> JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
> JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop),
>  
> schema,
> " USERS",
> new String[]{"attuid", "name", "email"},
> new Filter[]{ },
> 
> partitionList.toArray(new JDBCPartition[0])
>   
> );
> 
> System.out.println("count before to Java RDD=" + 
> jdbcRDD.cache().count());
> JavaRDD jrdd = jdbcRDD.toJavaRDD();
> System.out.println("count=" + jrdd.count());
> List lr = jrdd.collect();
> for (Row r : lr) {
> for (int ii = 0; ii < r.length(); ii++) {
> System.out.println(r.getString(ii));
> }
> }
> {code}
> ===
> result is :
> {code}
> 34
> tony
>  t...@appp.com
> 34
> tony
>  t...@appp.com
> 34
> tony 
>  t...@appp.com
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555647#comment-14555647
 ] 

Josh Rosen commented on SPARK-7804:
---

If possible, we should be hiding the internal JDBCRDD (all-caps) from the 
Javadoc; I've filed SPARK-7821 so that we remember to follow up on this.

Slightly confusingly, Spark also has another class called JdbcRDD (note the 
different capitalization) which _is_ a public API: 
https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/rdd/JdbcRDD.html. 
 Perhaps you meant to use that instead?

There might be a way to address your use-case while continuing to use the 
public DataFrame APIs, but I don't know enough about your use-case or Spark SQL 
APIs to provide a great answer.  The Spark Users mailing list would probably be 
a better place to have that discussion, though.

In the meantime, I'm going to resolve this JIRA ticket as "Not an Issue."

> Incorrect results from JDBCRDD -- one record repeatly
> -
>
> Key: SPARK-7804
> URL: https://issues.apache.org/jira/browse/SPARK-7804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Paul Wu
>
> Getting only one  record repeated in the RDD and repeated field value:
>  
> I have a table like:
> {code}
> attuid  name email
> 12  john   j...@appp.com
> 23  tom   t...@appp.com
> 34  tony  t...@appp.com
> {code}
> My code:
> {code}
>  JavaSparkContext sc = new JavaSparkContext(sparkConf);
> String url = "";
> java.util.Properties prop = new Properties();
> List partitionList = new ArrayList<>();
> //int i;
> partitionList.add(new JDBCPartition("1=1", 0));
> 
> List fields = new ArrayList();
> fields.add(DataTypes.createStructField("attuid", 
> DataTypes.StringType, true));
> fields.add(DataTypes.createStructField("name", DataTypes.StringType, 
> true));
> fields.add(DataTypes.createStructField("email", DataTypes.StringType, 
> true));
> StructType schema = DataTypes.createStructType(fields);
> JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
> JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop),
>  
> schema,
> " USERS",
> new String[]{"attuid", "name", "email"},
> new Filter[]{ },
> 
> partitionList.toArray(new JDBCPartition[0])
>   
> );
> 
> System.out.println("count before to Java RDD=" + 
> jdbcRDD.cache().count());
> JavaRDD jrdd = jdbcRDD.toJavaRDD();
> System.out.println("count=" + jrdd.count());
> List lr = jrdd.collect();
> for (Row r : lr) {
> for (int ii = 0; ii < r.length(); ii++) {
> System.out.println(r.getString(ii));
> }
> }
> {code}
> ===
> result is :
> {code}
> 34
> tony
>  t...@appp.com
> 34
> tony
>  t...@appp.com
> 34
> tony 
>  t...@appp.com
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-7804.
---
Resolution: Invalid

> Incorrect results from JDBCRDD -- one record repeatly
> -
>
> Key: SPARK-7804
> URL: https://issues.apache.org/jira/browse/SPARK-7804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Paul Wu
>
> Getting only one  record repeated in the RDD and repeated field value:
>  
> I have a table like:
> {code}
> attuid  name email
> 12  john   j...@appp.com
> 23  tom   t...@appp.com
> 34  tony  t...@appp.com
> {code}
> My code:
> {code}
>  JavaSparkContext sc = new JavaSparkContext(sparkConf);
> String url = "";
> java.util.Properties prop = new Properties();
> List partitionList = new ArrayList<>();
> //int i;
> partitionList.add(new JDBCPartition("1=1", 0));
> 
> List fields = new ArrayList();
> fields.add(DataTypes.createStructField("attuid", 
> DataTypes.StringType, true));
> fields.add(DataTypes.createStructField("name", DataTypes.StringType, 
> true));
> fields.add(DataTypes.createStructField("email", DataTypes.StringType, 
> true));
> StructType schema = DataTypes.createStructType(fields);
> JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
> JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop),
>  
> schema,
> " USERS",
> new String[]{"attuid", "name", "email"},
> new Filter[]{ },
> 
> partitionList.toArray(new JDBCPartition[0])
>   
> );
> 
> System.out.println("count before to Java RDD=" + 
> jdbcRDD.cache().count());
> JavaRDD jrdd = jdbcRDD.toJavaRDD();
> System.out.println("count=" + jrdd.count());
> List lr = jrdd.collect();
> for (Row r : lr) {
> for (int ii = 0; ii < r.length(); ii++) {
> System.out.println(r.getString(ii));
> }
> }
> {code}
> ===
> result is :
> {code}
> 34
> tony
>  t...@appp.com
> 34
> tony
>  t...@appp.com
> 34
> tony 
>  t...@appp.com
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4

2015-05-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629
 ] 

Xiangrui Meng edited comment on SPARK-7535 at 5/22/15 6:17 AM:
---

Some notes:

In PR #6322:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
1. Move Evaluator to ml.evaluation.
1. Mention larger metrics are better.
1. PipelineModel doc. “compiled” -> “fitted”
1. Hide PolynomialExpansion.expand
1. Hide VectorAssembler.
1. Word2Vec.minCount -> @param
1. ParamValidators -> DeveloperApi
1. Hide MetadataUtils/SchemaUtils.

Others:

1. @varargs to setDefault (SPARK-7498)

1. Update RegexTokenizer default setting. (SPARK-7794)
1. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)

1. Remove Params.validateParams(paramMap)?

1. param and getParam should be final (SPARK-7816)

1. UnresolvedAttribute (Java compatibility?)
1. Missing RegressionEvaluator (SPARK-7404)
1. ml.feature missing package doc (SPARK-7808)

1. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating
1. ALSModel -> remove training parameters?



was (Author: mengxr):
Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc (SPARK-7808)
10. param and getParam should be final (SPARK-7816)
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting. (SPARK-7794)
13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.

> Audit Pipeline APIs for 1.4
> ---
>
> Key: SPARK-7535
> URL: https://issues.apache.org/jira/browse/SPARK-7535
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> This is an umbrella for auditing the Pipeline (spark.ml) APIs.  Items to 
> check:
> * Public/protected/private access
> * Consistency across spark.ml
> * Classes, methods, and parameters in spark.mllib but missing in spark.ml
> ** We should create JIRAs for each of these (under an umbrella) as to-do 
> items for future releases.
> For each algorithm or API component, create a subtask under this umbrella.  
> Some major new items:
> * new feature transformers
> * tree models
> * elastic-net
> * ML attributes
> * developer APIs (Predictor, Classifier, Regressor)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7821) Hide private SQL JDBC classes from Javadoc

2015-05-21 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-7821:
-

 Summary: Hide private SQL JDBC classes from Javadoc
 Key: SPARK-7821
 URL: https://issues.apache.org/jira/browse/SPARK-7821
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Reporter: Josh Rosen


We should hide {{private\[sql\]}} JDBC classes from the generated Javadoc, 
since showing these internal classes can be confusing to users.  This is 
especially important for the SQL {{jdbc}} package because it contains an 
internal JDBCRDD class which is easily confused with the public JdbcRDD class 
in Spark Core (see SPARK-7804 for an example of this).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7712) Native Spark Window Functions & Performance Improvements

2015-05-21 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555629#comment-14555629
 ] 

Yin Huai commented on SPARK-7712:
-

btw, I just removed fix version. We set that after when we resolve the jira.

> Native Spark Window Functions & Performance Improvements 
> -
>
> Key: SPARK-7712
> URL: https://issues.apache.org/jira/browse/SPARK-7712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Herman van Hovell tot Westerflier
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Hi All,
> After playing with the current spark window implementation, I tried to take 
> this to next level. My main goal is/was to address the following issues: 
> Native Spark SQL & Performance.
> *Native Spark SQL*
> The current implementation uses Hive UDAFs as its aggregation mechanism. We 
> try to address the following issues  by moving to a more 'native' Spark SQL 
> approach:
> - Window functions require Hive. Some people (mostly by accident) use Spark 
> SQL without Hive. Usage of UDAFs is still supported though.
> - Adding your own Aggregates requires you to write them in Hive instead of 
> native Spark SQL.
> - Hive UDAFs are very well written and quite quick, but they are opaque in 
> processing and memory management; this makes them hard to optimize. By using 
> 'Native' Spark SQL constructs we can actually do alot more optimization, for 
> example AggregateEvaluation style Window processing (this would require us to 
> move some of the code out of the AggregateEvaluation class into some Common 
> base class), or Tungten style memory management.
> *Performance*
> - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current 
> implementation in spark uses a sliding window approach in these cases. This 
> means that an aggregate is maintained for every row, so space usage is N (N 
> being the number of rows). This also means that all these aggregates all need 
> to be updated separately, this takes N*(N-1)/2 updates. The running case 
> differs from the Sliding case because we are only adding data to an aggregate 
> function (no reset is required), we only need to maintain one aggregate (like 
> in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each 
> row, and get the aggregate value after each update. This is what the new 
> implementation does. This approach only uses 1 buffer, and only requires N 
> updates; I am currently working on data with window sizes of 500-1000 doing 
> running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED 
> FOLLOWING case also uses this approach and the fact that aggregate operations 
> are communitative, there is one twist though it will process the input buffer 
> in reverse.
> - Fewer comparisons in the sliding case. The current implementation 
> determines frame boundaries for every input row. The new implementation makes 
> more use of the fact that the window is sorted, maintains the boundaries, and 
> only moves them when the current row order changes. This is a minor 
> improvement.
> - A single Window node is able to process all types of Frames for the same 
> Partitioning/Ordering. This saves a little time/memory spent buffering and 
> managing partitions.
> - A lot of the staging code is moved from the execution phase to the 
> initialization phase. Minor performance improvement, and improves readability 
> of the execution code.
> The original work including some benchmarking code for the running case can 
> be here: https://github.com/hvanhovell/spark-window
> A PR has been created, this is still work in progress, and can be found here: 
> https://github.com/apache/spark/pull/6278
> Comments, feedback and other discussion is much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7578) User guide update for spark.ml IDF, Normalizer, StandardScaler

2015-05-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7578.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6127
[https://github.com/apache/spark/pull/6127]

> User guide update for spark.ml IDF, Normalizer, StandardScaler
> --
>
> Key: SPARK-7578
> URL: https://issues.apache.org/jira/browse/SPARK-7578
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.4.0
>
>
> Copied from [SPARK-7443]:
> {quote}
> Now that we have algorithms in spark.ml which are not in spark.mllib, we 
> should start making subsections for the spark.ml API as needed. We can follow 
> the structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.
> {quote}
> Note: I created a new subsection for links to spark.ml-specific guides in 
> this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
> subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7712) Native Spark Window Functions & Performance Improvements

2015-05-21 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555627#comment-14555627
 ] 

Yin Huai commented on SPARK-7712:
-

[~hvanhovell] Thank you for the update. These optimizations look great. 

Since we will have a major refactoring of UDAF interfaces after 1.4 release and 
window functions are quite related to that, I propose to work on our window 
function improvement with our UDAF refactoring work or after we have fixed the 
design of the UDAF interfaces (we can figure out how we are going to proceed 
once we release 1.4). What do you think? Also, please feel free to post more 
thoughts at here and we can use this jira to do more design discussion.

> Native Spark Window Functions & Performance Improvements 
> -
>
> Key: SPARK-7712
> URL: https://issues.apache.org/jira/browse/SPARK-7712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Herman van Hovell tot Westerflier
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Hi All,
> After playing with the current spark window implementation, I tried to take 
> this to next level. My main goal is/was to address the following issues: 
> Native Spark SQL & Performance.
> *Native Spark SQL*
> The current implementation uses Hive UDAFs as its aggregation mechanism. We 
> try to address the following issues  by moving to a more 'native' Spark SQL 
> approach:
> - Window functions require Hive. Some people (mostly by accident) use Spark 
> SQL without Hive. Usage of UDAFs is still supported though.
> - Adding your own Aggregates requires you to write them in Hive instead of 
> native Spark SQL.
> - Hive UDAFs are very well written and quite quick, but they are opaque in 
> processing and memory management; this makes them hard to optimize. By using 
> 'Native' Spark SQL constructs we can actually do alot more optimization, for 
> example AggregateEvaluation style Window processing (this would require us to 
> move some of the code out of the AggregateEvaluation class into some Common 
> base class), or Tungten style memory management.
> *Performance*
> - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current 
> implementation in spark uses a sliding window approach in these cases. This 
> means that an aggregate is maintained for every row, so space usage is N (N 
> being the number of rows). This also means that all these aggregates all need 
> to be updated separately, this takes N*(N-1)/2 updates. The running case 
> differs from the Sliding case because we are only adding data to an aggregate 
> function (no reset is required), we only need to maintain one aggregate (like 
> in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each 
> row, and get the aggregate value after each update. This is what the new 
> implementation does. This approach only uses 1 buffer, and only requires N 
> updates; I am currently working on data with window sizes of 500-1000 doing 
> running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED 
> FOLLOWING case also uses this approach and the fact that aggregate operations 
> are communitative, there is one twist though it will process the input buffer 
> in reverse.
> - Fewer comparisons in the sliding case. The current implementation 
> determines frame boundaries for every input row. The new implementation makes 
> more use of the fact that the window is sorted, maintains the boundaries, and 
> only moves them when the current row order changes. This is a minor 
> improvement.
> - A single Window node is able to process all types of Frames for the same 
> Partitioning/Ordering. This saves a little time/memory spent buffering and 
> managing partitions.
> - A lot of the staging code is moved from the execution phase to the 
> initialization phase. Minor performance improvement, and improves readability 
> of the execution code.
> The original work including some benchmarking code for the running case can 
> be here: https://github.com/hvanhovell/spark-window
> A PR has been created, this is still work in progress, and can be found here: 
> https://github.com/apache/spark/pull/6278
> Comments, feedback and other discussion is much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7712) Native Spark Window Functions & Performance Improvements

2015-05-21 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7712:

Fix Version/s: (was: 1.5.0)

> Native Spark Window Functions & Performance Improvements 
> -
>
> Key: SPARK-7712
> URL: https://issues.apache.org/jira/browse/SPARK-7712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Herman van Hovell tot Westerflier
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Hi All,
> After playing with the current spark window implementation, I tried to take 
> this to next level. My main goal is/was to address the following issues: 
> Native Spark SQL & Performance.
> *Native Spark SQL*
> The current implementation uses Hive UDAFs as its aggregation mechanism. We 
> try to address the following issues  by moving to a more 'native' Spark SQL 
> approach:
> - Window functions require Hive. Some people (mostly by accident) use Spark 
> SQL without Hive. Usage of UDAFs is still supported though.
> - Adding your own Aggregates requires you to write them in Hive instead of 
> native Spark SQL.
> - Hive UDAFs are very well written and quite quick, but they are opaque in 
> processing and memory management; this makes them hard to optimize. By using 
> 'Native' Spark SQL constructs we can actually do alot more optimization, for 
> example AggregateEvaluation style Window processing (this would require us to 
> move some of the code out of the AggregateEvaluation class into some Common 
> base class), or Tungten style memory management.
> *Performance*
> - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current 
> implementation in spark uses a sliding window approach in these cases. This 
> means that an aggregate is maintained for every row, so space usage is N (N 
> being the number of rows). This also means that all these aggregates all need 
> to be updated separately, this takes N*(N-1)/2 updates. The running case 
> differs from the Sliding case because we are only adding data to an aggregate 
> function (no reset is required), we only need to maintain one aggregate (like 
> in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each 
> row, and get the aggregate value after each update. This is what the new 
> implementation does. This approach only uses 1 buffer, and only requires N 
> updates; I am currently working on data with window sizes of 500-1000 doing 
> running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED 
> FOLLOWING case also uses this approach and the fact that aggregate operations 
> are communitative, there is one twist though it will process the input buffer 
> in reverse.
> - Fewer comparisons in the sliding case. The current implementation 
> determines frame boundaries for every input row. The new implementation makes 
> more use of the fact that the window is sorted, maintains the boundaries, and 
> only moves them when the current row order changes. This is a minor 
> improvement.
> - A single Window node is able to process all types of Frames for the same 
> Partitioning/Ordering. This saves a little time/memory spent buffering and 
> managing partitions.
> - A lot of the staging code is moved from the execution phase to the 
> initialization phase. Minor performance improvement, and improves readability 
> of the execution code.
> The original work including some benchmarking code for the running case can 
> be here: https://github.com/hvanhovell/spark-window
> A PR has been created, this is still work in progress, and can be found here: 
> https://github.com/apache/spark/pull/6278
> Comments, feedback and other discussion is much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7819:

Priority: Critical  (was: Major)

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7819:

Affects Version/s: (was: 1.4.1)
   1.4.0

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7819:

Target Version/s: 1.4.0

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Fi
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7820) Java 8 test suite compile error under SBT

2015-05-21 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-7820:
---
Priority: Minor  (was: Major)

> Java 8 test suite compile error under SBT
> -
>
> Key: SPARK-7820
> URL: https://issues.apache.org/jira/browse/SPARK-7820
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.4.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Lots of compilation error is shown when java 8 test suite is enabled in SBT:
> {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 
> -Dhadoop.version=2.6.0 -Pjava8-test}}
> {code}
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43:
>  error: cannot find symbol
> [error] public class Java8APISuite extends LocalJavaStreamingContext 
> implements Serializable {
> [error]^
> [error]   symbol: class LocalJavaStreamingContext
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
>  error: cannot find symbol
> [error] JavaDStream stream = 
> JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
> [error]  ^
> [error]   symbol:   variable ssc
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
>  error: cannot find symbol
> [error] JavaDStream stream = 
> JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
> [error]  ^
> [error]   symbol:   variable JavaTestUtils
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57:
>  error: cannot find symbol
> [error] JavaTestUtils.attachTestOutputStream(letterCount);
> [error] ^
> [error]   symbol:   variable JavaTestUtils
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
>  error: cannot find symbol
> [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
> [error]   ^
> [error]   symbol:   variable ssc
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
>  error: cannot find symbol
> [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
> [error]  ^
> [error]   symbol:   variable JavaTestUtils
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73:
>  error: cannot find symbol
> [error] JavaDStream stream = 
> JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
> [error]  ^
> [error]   symbol:   variable ssc
> [error]   location: class Java8APISuite
> {code}
> The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which 
> exists in streaming test jar. It is OK for maven compile, since it will 
> generate test jar, but will be failed in sbt test compile, sbt do not 
> generate test jar by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555602#comment-14555602
 ] 

Paul Wu commented on SPARK-7804:


Thanks -- you are right. The cache() was a problem and also I cannot use 
"List lr = jrdd.collect();". But 

jrdd.foreach((Row r) -> {
 System.out.println(r.get(0) + " ." + r.get(1) + " " + r.get(2));
 });
or foreachParition will work. 

We really wanted to use DataFrame, however it does not have the partition 
options that we really need to improve the performance. Using this class, we 
can take the advantage of sending multiple query to each db partition at the 
same time. By as you said this is the internal code (from JAVA DOC, I cannot 
see it), I'm not sure what I can do now.

I guess you guys can close this ticket. Thanks again!  

> Incorrect results from JDBCRDD -- one record repeatly
> -
>
> Key: SPARK-7804
> URL: https://issues.apache.org/jira/browse/SPARK-7804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Paul Wu
>  Labels: JDBCRDD, sql
>
> Getting only one  record repeated in the RDD and repeated field value:
>  
> I have a table like:
> {code}
> attuid  name email
> 12  john   j...@appp.com
> 23  tom   t...@appp.com
> 34  tony  t...@appp.com
> {code}
> My code:
> {code}
>  JavaSparkContext sc = new JavaSparkContext(sparkConf);
> String url = "";
> java.util.Properties prop = new Properties();
> List partitionList = new ArrayList<>();
> //int i;
> partitionList.add(new JDBCPartition("1=1", 0));
> 
> List fields = new ArrayList();
> fields.add(DataTypes.createStructField("attuid", 
> DataTypes.StringType, true));
> fields.add(DataTypes.createStructField("name", DataTypes.StringType, 
> true));
> fields.add(DataTypes.createStructField("email", DataTypes.StringType, 
> true));
> StructType schema = DataTypes.createStructType(fields);
> JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
> JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop),
>  
> schema,
> " USERS",
> new String[]{"attuid", "name", "email"},
> new Filter[]{ },
> 
> partitionList.toArray(new JDBCPartition[0])
>   
> );
> 
> System.out.println("count before to Java RDD=" + 
> jdbcRDD.cache().count());
> JavaRDD jrdd = jdbcRDD.toJavaRDD();
> System.out.println("count=" + jrdd.count());
> List lr = jrdd.collect();
> for (Row r : lr) {
> for (int ii = 0; ii < r.length(); ii++) {
> System.out.println(r.getString(ii));
> }
> }
> {code}
> ===
> result is :
> {code}
> 34
> tony
>  t...@appp.com
> 34
> tony
>  t...@appp.com
> 34
> tony 
>  t...@appp.com
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7820) Java 8 test suite compile error under SBT

2015-05-21 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-7820:
---
Component/s: Streaming

> Java 8 test suite compile error under SBT
> -
>
> Key: SPARK-7820
> URL: https://issues.apache.org/jira/browse/SPARK-7820
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.4.0
>Reporter: Saisai Shao
>
> Lots of compilation error is shown when java 8 test suite is enabled in SBT:
> {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 
> -Dhadoop.version=2.6.0 -Pjava8-test}}
> {code}
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43:
>  error: cannot find symbol
> [error] public class Java8APISuite extends LocalJavaStreamingContext 
> implements Serializable {
> [error]^
> [error]   symbol: class LocalJavaStreamingContext
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
>  error: cannot find symbol
> [error] JavaDStream stream = 
> JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
> [error]  ^
> [error]   symbol:   variable ssc
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
>  error: cannot find symbol
> [error] JavaDStream stream = 
> JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
> [error]  ^
> [error]   symbol:   variable JavaTestUtils
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57:
>  error: cannot find symbol
> [error] JavaTestUtils.attachTestOutputStream(letterCount);
> [error] ^
> [error]   symbol:   variable JavaTestUtils
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
>  error: cannot find symbol
> [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
> [error]   ^
> [error]   symbol:   variable ssc
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
>  error: cannot find symbol
> [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
> [error]  ^
> [error]   symbol:   variable JavaTestUtils
> [error]   location: class Java8APISuite
> [error] 
> /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73:
>  error: cannot find symbol
> [error] JavaDStream stream = 
> JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
> [error]  ^
> [error]   symbol:   variable ssc
> [error]   location: class Java8APISuite
> {code}
> The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which 
> exists in streaming test jar. It is OK for maven compile, since it will 
> generate test jar, but will be failed in sbt test compile, sbt do not 
> generate test jar by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Fi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1487#comment-1487
 ] 

Fi commented on SPARK-7819:
---

FYI, I believe I have worked around the problem for now by disabling isolation 
by hacking:

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

-isolationOn = true,
+   isolationOn = false,

I suppose this is good enough for us since we only need the pre-built version 
of hive (as provided by the mapr4 profile).

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Fi
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Fi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fi updated SPARK-7819:
--
Attachment: stacktrace.txt

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Fi
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7820) Java 8 test suite compile error under SBT

2015-05-21 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-7820:
--

 Summary: Java 8 test suite compile error under SBT
 Key: SPARK-7820
 URL: https://issues.apache.org/jira/browse/SPARK-7820
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Saisai Shao


Lots of compilation error is shown when java 8 test suite is enabled in SBT:

{{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.6.0 -Pjava8-test}}

{code}
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43:
 error: cannot find symbol
[error] public class Java8APISuite extends LocalJavaStreamingContext implements 
Serializable {
[error]^
[error]   symbol: class LocalJavaStreamingContext
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
 error: cannot find symbol
[error] JavaDStream stream = 
JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
[error]  ^
[error]   symbol:   variable ssc
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
 error: cannot find symbol
[error] JavaDStream stream = 
JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
[error]  ^
[error]   symbol:   variable JavaTestUtils
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57:
 error: cannot find symbol
[error] JavaTestUtils.attachTestOutputStream(letterCount);
[error] ^
[error]   symbol:   variable JavaTestUtils
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
 error: cannot find symbol
[error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
[error]   ^
[error]   symbol:   variable ssc
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
 error: cannot find symbol
[error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
[error]  ^
[error]   symbol:   variable JavaTestUtils
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73:
 error: cannot find symbol
[error] JavaDStream stream = 
JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
[error]  ^
[error]   symbol:   variable ssc
[error]   location: class Java8APISuite
{code}

The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which exists 
in streaming test jar. It is OK for maven compile, since it will generate test 
jar, but will be failed in sbt test compile, sbt do not generate test jar by 
default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Fi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fi updated SPARK-7819:
--
Attachment: (was: stacktrace.txt)

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Fi
> Attachments: test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Fi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fi updated SPARK-7819:
--
Attachment: stacktrace.txt
test.py

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Fi
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request:
> https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> *** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-21 Thread Fi (JIRA)

Fi created SPARK-7819:
-

 Summary: Isolated Hive Client Loader appears to cause Native 
Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
 Key: SPARK-7819
 URL: https://issues.apache.org/jira/browse/SPARK-7819
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.1
Reporter: Fi
 Attachments: stacktrace.txt, test.py

In reference to the pull request:
https://github.com/apache/spark/pull/5876

I have been running the Spark 1.3 branch for some time with no major hiccups, 
and recently switched to the Spark 1.4 branch.

I build my spark distribution with the following build command:

make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive -Phive-0.13.1 
-Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver

When running a python script containing a series of smoke tests I use to 
validate the build, I encountered an error under the following conditions:

* start a spark context
* start a hive context
* run any hive query
* stop the spark context
* start a second spark context
* run any hive query
*** ERROR

>From what I can tell, the Isolated Class Loader is hitting a MapR class that 
>is loading its native library (presumedly as part of a static initializer).

Unfortunately, the JVM prohibits this the second time around.

I would think that shutting down the SparkContext would clear out any 
vestigials of the JVM, so I'm surprised that this would even be a problem.

Note: all other smoke tests we are running passes fine.

I will attach the stacktrace and a python script reproducing the issue (at 
least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7818) Java 8 test suite compile error under SBT

2015-05-21 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-7818:
--

 Summary: Java 8 test suite compile error under SBT
 Key: SPARK-7818
 URL: https://issues.apache.org/jira/browse/SPARK-7818
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Saisai Shao


Lots of compilation error is shown when java 8 test suite is enabled in SBT:

{{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.6.0 -Pjava8-test}}

{code}
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43:
 error: cannot find symbol
[error] public class Java8APISuite extends LocalJavaStreamingContext implements 
Serializable {
[error]^
[error]   symbol: class LocalJavaStreamingContext
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
 error: cannot find symbol
[error] JavaDStream stream = 
JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
[error]  ^
[error]   symbol:   variable ssc
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
 error: cannot find symbol
[error] JavaDStream stream = 
JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
[error]  ^
[error]   symbol:   variable JavaTestUtils
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57:
 error: cannot find symbol
[error] JavaTestUtils.attachTestOutputStream(letterCount);
[error] ^
[error]   symbol:   variable JavaTestUtils
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
 error: cannot find symbol
[error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
[error]   ^
[error]   symbol:   variable ssc
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
 error: cannot find symbol
[error] List> result = JavaTestUtils.runStreams(ssc, 2, 2);
[error]  ^
[error]   symbol:   variable JavaTestUtils
[error]   location: class Java8APISuite
[error] 
/mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73:
 error: cannot find symbol
[error] JavaDStream stream = 
JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
[error]  ^
[error]   symbol:   variable ssc
[error]   location: class Java8APISuite
{code}

The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which exists 
in streaming test jar. It is OK for maven compile, since it will generate test 
jar, but will be failed in sbt test compile, sbt do not generate test jar by 
default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7536) Audit MLlib Python API for 1.4

2015-05-21 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1459#comment-1459
 ] 

Manoj Kumar commented on SPARK-7536:


Should all of this be done before the 1.4 release?

> Audit MLlib Python API for 1.4
> --
>
> Key: SPARK-7536
> URL: https://issues.apache.org/jira/browse/SPARK-7536
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match? SPARK-7667
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc. SPARK-7666
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release. SPARK-7665
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python.
> ** classification
> *** StreamingLogisticRegressionWithSGD SPARK-7633
> ** clustering
> *** GaussianMixture SPARK-6258
> *** LDA SPARK-6259
> *** Power Iteration Clustering SPARK-5962
> *** StreamingKMeans SPARK-4118 
> ** evaluation
> *** MultilabelMetrics SPARK-6094 
> ** feature
> *** ElementwiseProduct SPARK-7605
> *** PCA SPARK-7604
> ** linalg
> *** Distributed linear algebra SPARK-6100
> ** pmml.export SPARK-7638
> ** regression
> *** StreamingLinearRegressionWithSGD SPARK-4127
> ** stat
> *** KernelDensity SPARK-7639
> ** util
> *** MLUtils SPARK-6263 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7404:
---

Assignee: Apache Spark  (was: Ram Sriharsha)

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7404:
---

Assignee: Ram Sriharsha  (was: Apache Spark)

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1422#comment-1422
 ] 

Apache Spark commented on SPARK-7404:
-

User 'harsha2010' has created a pull request for this issue:
https://github.com/apache/spark/pull/6344

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement iter()

2015-05-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555482#comment-14555482
 ] 

Nicholas Chammas commented on SPARK-7507:
-

Since {{Row}} seems most analogous to a {{namedtuple}} in Python, here is an 
interesting parallel that suggests we should perhaps instead support 
{{vars(Row)}} and not {{dict(Row)}}.

http://stackoverflow.com/q/26180528/877069
https://docs.python.org/3/library/functions.html#vars
https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict

{quote}
somenamedtuple._asdict()

Return a new OrderedDict which maps field names to their corresponding values.

Note, this method is no longer needed now that the same effect can be achieved 
by using the built-in vars() function:
{quote}

> pyspark.sql.types.StructType and Row should implement __iter__()
> 
>
> Key: SPARK-7507
> URL: https://issues.apache.org/jira/browse/SPARK-7507
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{StructType}} looks an awful lot like a Python dictionary.
> However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion 
> like this doesn't work:
> {code}
> >>> df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
> >>> df.schema
> StructType(List(StructField(name,StringType,true)))
> >>> dict(df.schema)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: 'StructType' object is not iterable
> {code}
> This would be super helpful for doing any custom schema manipulations without 
> having to go through the whole {{.json() -> json.loads() -> manipulate() -> 
> json.dumps() -> .fromJson()}} charade.
> Same goes for {{Row}}, which offers an 
> [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict]
>  method but doesn't support the more Pythonic {{dict(Row)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7605) Python API for ElementwiseProduct

2015-05-21 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555475#comment-14555475
 ] 

Manoj Kumar commented on SPARK-7605:


Hi, Can this be assigned to me?

> Python API for ElementwiseProduct
> -
>
> Key: SPARK-7605
> URL: https://issues.apache.org/jira/browse/SPARK-7605
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Python API for org.apache.spark.mllib.feature.ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7322) Add DataFrame DSL for window function support

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555462#comment-14555462
 ] 

Apache Spark commented on SPARK-7322:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6343

> Add DataFrame DSL for window function support
> -
>
> Key: SPARK-7322
> URL: https://issues.apache.org/jira/browse/SPARK-7322
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>  Labels: DataFrame
>
> Here's a proposal for supporting window functions in the DataFrame DSL:
> 1. Add an over function to Column:
> {code}
> class Column {
>   ...
>   def over(window: Window): Column
>   ...
> }
> {code}
> 2. Window:
> {code}
> object Window {
>   def partitionBy(...): Window
>   def orderBy(...): Window
>   object Frame {
> def unbounded: Frame
> def preceding(n: Long): Frame
> def following(n: Long): Frame
>   }
>   class Frame
> }
> class Window {
>   def orderBy(...): Window
>   def rowsBetween(Frame, Frame): Window
>   def rangeBetween(Frame, Frame): Window  // maybe add this later
> }
> {code}
> Here's an example to use it:
> {code}
> df.select(
>   avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
> .rowsBetween(Frame.unbounded, Frame.currentRow))
> )
> df.select(
>   avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
> .rowsBetween(Frame.preceding(50), Frame.following(10)))
> )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7817) Intellij Idea cannot find symbol when import scala object

2015-05-21 Thread bofei.xiao (JIRA)

bofei.xiao created SPARK-7817:
-

 Summary: Intellij Idea cannot find symbol when import scala object
 Key: SPARK-7817
 URL: https://issues.apache.org/jira/browse/SPARK-7817
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.3.1
 Environment: micorosoft server 2003
java 1.6
maven 3.04
Reporter: bofei.xiao


[ERROR] 
src\main\java\org\apache\spark\exaples\streaming\JavaQueueStream.java:[33,47] 
cannot find symbol
symbol  : class StreamingExamples
location: package org.apache.spark.exaples.streaming

in fact,StreamingExamples is a object under org.apache.spark.exaples.streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7785:
---
Priority: Minor  (was: Major)

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Add __str__ and  __repr__ to matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7785:
---
Description: Add __str__ and  __repr__ to matrices.  (was: For 
DenseMatrices.
Class Methods
__str__, transpose
Object Methods
zeros, ones, eye, rand, randn, diag

For SparseMatrices
Class Methods
__str__, transpose
Object Methods,
fromCoo, speye, sprand, sprandn, spdiag,

Matrices Methods, horzcat, vertcat)

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> Add __str__ and  __repr__ to matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555440#comment-14555440
 ] 

Burak Yavuz commented on SPARK-7785:


For operations with BlockMatrix, you will need these classes.

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555438#comment-14555438
 ] 

Manoj Kumar commented on SPARK-7785:


Sounds great. In the Pull Request, I just added support for __str__ and 
__repr__ . But was there any particular need to have these classes in the first 
place, since almost all of them are wrappers around numpy and scipy?

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555436#comment-14555436
 ] 

Apache Spark commented on SPARK-7785:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6342

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7785:
---

Assignee: (was: Apache Spark)

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7785:
---

Assignee: Apache Spark

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Assignee: Apache Spark
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients

2015-05-21 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555435#comment-14555435
 ] 

meiyoula edited comment on SPARK-7789 at 5/22/15 1:51 AM:
--

I also used hive 0.13 and Kerberos. [~deanchen]Has you executed the select 
statement. Below is my test sql statement.
{quote}
create table s1 
(
key1 string,
c11 int,
c12 string,
c13 string,
c14 string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties(
"hbase.columns.mapping" = ":key,
info:c11,
info:c12,
info:c13,
info:c14
")
tblproperties("hbase.table.name" = "shb1");
select * from s1;
{quote}

After reading the hive and hbase code, I think the root cause is that:
When the driver obtained the hbase token and add it into Credentials of 
CurrentUser, the hbase token will also go to executors. So the authentication 
of user(in executor) is TOKEN to hbase.But the hive code will send request to 
hbase sever to obtain token no matter what the authentication is. And the hbase 
code just allow the Kerberos authenticated clients to obtain token. So the 
exception occurs.

So I think the HIVE-8874 is meaningful, it should be merged.


was (Author: meiyoula):
I also used hive 0.13 and Kerberos. [~deanchen]Has you executed the select 
statement. Below is my test sql statement.

After reading the hive and hbase code, I think the root cause is that:
When the driver obtained the hbase token and add it into Credentials of 
CurrentUser, the hbase token will also go to executors. So the authentication 
of user(in executor) is TOKEN to hbase.But the hive code will send request to 
hbase sever to obtain token no matter what the authentication is. And the hbase 
code just allow the Kerberos authenticated clients to obtain token. So the 
exception occurs.
{quote}
create table s1 
(
key1 string,
c11 int,
c12 string,
c13 string,
c14 string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties(
"hbase.columns.mapping" = ":key,
info:c11,
info:c12,
info:c13,
info:c14
")
tblproperties("hbase.table.name" = "shb1");
select * from s1;
{quote}
So I think the HIVE-8874 is meaningful, it should be merged.

> sql  on security hbase:Token generation only allowed for Kerberos 
> authenticated clients
> ---
>
> Key: SPARK-7789
> URL: https://issues.apache.org/jira/browse/SPARK-7789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> After creating a hbase table in beeline, then execute select sql statement, 
> Executor occurs the exception:
> {quote}
> java.lang.IllegalStateException: Error while configuring input job properties
> at 
> org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
> at 
> org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
> at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
> at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark

[jira] [Updated] (SPARK-7680) Add a fake Receiver that generates random strings, useful for prototyping

2015-05-21 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7680:
-
Target Version/s:   (was: 1.4.0)

> Add a fake Receiver that generates random strings, useful for prototyping
> -
>
> Key: SPARK-7680
> URL: https://issues.apache.org/jira/browse/SPARK-7680
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients

2015-05-21 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555435#comment-14555435
 ] 

meiyoula commented on SPARK-7789:
-

I also used hive 0.13 and Kerberos. [~deanchen]Has you executed the select 
statement. Below is my test sql statement.

After reading the hive and hbase code, I think the root cause is that:
When the driver obtained the hbase token and add it into Credentials of 
CurrentUser, the hbase token will also go to executors. So the authentication 
of user(in executor) is TOKEN to hbase.But the hive code will send request to 
hbase sever to obtain token no matter what the authentication is. And the hbase 
code just allow the Kerberos authenticated clients to obtain token. So the 
exception occurs.
{quote}
create table s1 
(
key1 string,
c11 int,
c12 string,
c13 string,
c14 string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties(
"hbase.columns.mapping" = ":key,
info:c11,
info:c12,
info:c13,
info:c14
")
tblproperties("hbase.table.name" = "shb1");
select * from s1;
{quote}
So I think the HIVE-8874 is meaningful, it should be merged.

> sql  on security hbase:Token generation only allowed for Kerberos 
> authenticated clients
> ---
>
> Key: SPARK-7789
> URL: https://issues.apache.org/jira/browse/SPARK-7789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> After creating a hbase table in beeline, then execute select sql statement, 
> Executor occurs the exception:
> {quote}
> java.lang.IllegalStateException: Error while configuring input job properties
> at 
> org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
> at 
> org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
> at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
> at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: 
> org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
> allowed for Kerberos authenticated clients
> at 
> org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267)
> at 
> org.apache.hadoop.hbase.protobuf.generated.Authenticati

[jira] [Updated] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7785:
---
Summary: Add pretty printing to pyspark.mllib.linalg.Matrices  (was: Add 
missing items to pyspark.mllib.linalg.Matrices)

> Add pretty printing to pyspark.mllib.linalg.Matrices
> 
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555423#comment-14555423
 ] 

Apache Spark commented on SPARK-7042:
-

User 'kostya-sh' has created a pull request for this issue:
https://github.com/apache/spark/pull/6341

> Spark version of akka-actor_2.11 is not compatible with the official 
> akka-actor_2.11 2.3.x
> --
>
> Key: SPARK-7042
> URL: https://issues.apache.org/jira/browse/SPARK-7042
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
> with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
> error:
> {noformat}
> 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
> [sparkDriver-akka.actor.default-dispatcher-5] -
> Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
> failed, address is now gated for [5000] ms.
> Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
> serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
> {noformat}
> It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
> built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
> (see https://issues.scala-lang.org/browse/SI-8549).
> The following steps can resolve the issue:
> - re-build the custom akka library that is used by Spark with the more recent 
> version of Scala compiler (e.g. 2.11.6) 
> - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
> - update version of akka used by spark (master and 1.3 branch)
> I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
> 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7446) Inverse transform for StringIndexer

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7446:
---

Assignee: holdenk  (was: Apache Spark)

> Inverse transform for StringIndexer
> ---
>
> Key: SPARK-7446
> URL: https://issues.apache.org/jira/browse/SPARK-7446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: holdenk
>Priority: Minor
>
> It is useful to convert the encoded indices back to their string 
> representation for result inspection. We can add a parameter to 
> StringIndexer/StringIndexModel for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7446) Inverse transform for StringIndexer

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555420#comment-14555420
 ] 

Apache Spark commented on SPARK-7446:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/6339

> Inverse transform for StringIndexer
> ---
>
> Key: SPARK-7446
> URL: https://issues.apache.org/jira/browse/SPARK-7446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: holdenk
>Priority: Minor
>
> It is useful to convert the encoded indices back to their string 
> representation for result inspection. We can add a parameter to 
> StringIndexer/StringIndexModel for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7446) Inverse transform for StringIndexer

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7446:
---

Assignee: Apache Spark  (was: holdenk)

> Inverse transform for StringIndexer
> ---
>
> Key: SPARK-7446
> URL: https://issues.apache.org/jira/browse/SPARK-7446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> It is useful to convert the encoded indices back to their string 
> representation for result inspection. We can add a parameter to 
> StringIndexer/StringIndexModel for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7657) [YARN] Show driver link in Spark UI

2015-05-21 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-7657:

Assignee: Hari Shreedharan  (was: Imran Rashid)

> [YARN] Show driver link in Spark UI
> ---
>
> Key: SPARK-7657
> URL: https://issues.apache.org/jira/browse/SPARK-7657
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently, the driver link does not show up in the application UI. It is 
> painful to debug apps running in cluster mode if the link does not show up. 
> Client mode is fine since the links are local to the client machine.
> In YARN mode, it is possible to just get this from the YARN container report. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7657) [YARN] Show driver link in Spark UI

2015-05-21 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-7657:
---

Assignee: Imran Rashid

> [YARN] Show driver link in Spark UI
> ---
>
> Key: SPARK-7657
> URL: https://issues.apache.org/jira/browse/SPARK-7657
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Hari Shreedharan
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently, the driver link does not show up in the application UI. It is 
> painful to debug apps running in cluster mode if the link does not show up. 
> Client mode is fine since the links are local to the client machine.
> In YARN mode, it is possible to just get this from the YARN container report. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7657) [YARN] Show driver link in Spark UI

2015-05-21 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-7657.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6166
[https://github.com/apache/spark/pull/6166]

> [YARN] Show driver link in Spark UI
> ---
>
> Key: SPARK-7657
> URL: https://issues.apache.org/jira/browse/SPARK-7657
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Hari Shreedharan
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently, the driver link does not show up in the application UI. It is 
> painful to debug apps running in cluster mode if the link does not show up. 
> Client mode is fine since the links are local to the client machine.
> In YARN mode, it is possible to just get this from the YARN container report. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4

2015-05-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629
 ] 

Xiangrui Meng edited comment on SPARK-7535 at 5/22/15 1:21 AM:
---

Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc (SPARK-7808)
10. param and getParam should be final (SPARK-7816)
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting. (SPARK-7794)
13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.


was (Author: mengxr):
Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc (SPARK-7808)
10. param and getParam should be final
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting. (SPARK-7794)
13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.

> Audit Pipeline APIs for 1.4
> ---
>
> Key: SPARK-7535
> URL: https://issues.apache.org/jira/browse/SPARK-7535
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> This is an umbrella for auditing the Pipeline (spark.ml) APIs.  Items to 
> check:
> * Public/protected/private access
> * Consistency across spark.ml
> * Classes, methods, and parameters in spark.mllib but missing in spark.ml
> ** We should create JIRAs for each of these (under an umbrella) as to-do 
> items for future releases.
> For each algorithm or API component, create a subtask under this umbrella.  
> Some major new items:
> * new feature transformers
> * tree models
> * elastic-net
> * ML attributes
> * developer APIs (Predictor, Classifier, Regressor)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7816) Mark params, getters, and user-facing classes final

2015-05-21 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-7816:


 Summary: Mark params, getters, and user-facing classes final
 Key: SPARK-7816
 URL: https://issues.apache.org/jira/browse/SPARK-7816
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


This is to tighten spark.ml APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7815) Move UTF8String into Unsafe java package, and have it work against memory address directly

2015-05-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-7815:
--

 Summary: Move UTF8String into Unsafe java package, and have it 
work against memory address directly
 Key: SPARK-7815
 URL: https://issues.apache.org/jira/browse/SPARK-7815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu


So we can avoid an extra copy of data into byte array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7814) Turn code generation on by default

2015-05-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-7814:
--

 Summary: Turn code generation on by default
 Key: SPARK-7814
 URL: https://issues.apache.org/jira/browse/SPARK-7814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7813) Push code generation into expression definition

2015-05-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7813:
---
Labels: codegen  (was: )

> Push code generation into expression definition
> ---
>
> Key: SPARK-7813
> URL: https://issues.apache.org/jira/browse/SPARK-7813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>Priority: Critical
>  Labels: codegen
>
> Right now we define all expression code generation in a single file. If we 
> want to do code generation for most default expressions, it'd only make sense 
> to push them into the expression definitions themselves (similar to "eval" 
> method).
> We would need to design an updated version of the expression API for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7813) Push code generation into expression definition

2015-05-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7813:
---
Summary: Push code generation into expression definition  (was: Push code 
generation into expression definition themselves)

> Push code generation into expression definition
> ---
>
> Key: SPARK-7813
> URL: https://issues.apache.org/jira/browse/SPARK-7813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>Priority: Critical
>
> Right now we define all expression code generation in a single file. If we 
> want to do code generation for most default expressions, it'd only make sense 
> to push them into the expression definitions themselves (similar to "eval" 
> method).
> We would need to design an updated version of the expression API for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7813) Push code generation into expression definition themselves

2015-05-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-7813:
--

 Summary: Push code generation into expression definition themselves
 Key: SPARK-7813
 URL: https://issues.apache.org/jira/browse/SPARK-7813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
Priority: Critical


Right now we define all expression code generation in a single file. If we want 
to do code generation for most default expressions, it'd only make sense to 
push them into the expression definitions themselves (similar to "eval" method).

We would need to design an updated version of the expression API for that.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7219) HashingTF should output ML attributes

2015-05-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7219.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6308
[https://github.com/apache/spark/pull/6308]

> HashingTF should output ML attributes
> -
>
> Key: SPARK-7219
> URL: https://issues.apache.org/jira/browse/SPARK-7219
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Trivial
> Fix For: 1.4.0
>
>
> HashingTF knows the output feature dimension, which should be in the output 
> ML attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7812) Speed up SQL code generation

2015-05-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-7812:
--

 Summary: Speed up SQL code generation
 Key: SPARK-7812
 URL: https://issues.apache.org/jira/browse/SPARK-7812
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
Priority: Critical


Explore other frameworks to speed up code generation for SQL expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7794) Update RegexTokenizer default settings.

2015-05-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7794.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6330
[https://github.com/apache/spark/pull/6330]

> Update RegexTokenizer default settings.
> ---
>
> Key: SPARK-7794
> URL: https://issues.apache.org/jira/browse/SPARK-7794
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Should use a simple default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7811) Fix typo on slf4j configuration on metrics.properties.template

2015-05-21 Thread Judy Nash (JIRA)

Judy Nash created SPARK-7811:


 Summary: Fix typo on slf4j configuration on 
metrics.properties.template
 Key: SPARK-7811
 URL: https://issues.apache.org/jira/browse/SPARK-7811
 Project: Spark
  Issue Type: Bug
Reporter: Judy Nash
Priority: Minor


There are a minor typo on slf4jsink configuration at 
metrics.properties.template. 

slf4j is mispelled as sl4j on 2 of the configuration. 

Correcting the typo so users' custom settings will be loaded correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7776) Add shutdown hook to stop StreamingContext

2015-05-21 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7776.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

> Add shutdown hook to stop StreamingContext
> --
>
> Key: SPARK-7776
> URL: https://issues.apache.org/jira/browse/SPARK-7776
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Shutdown hook to stop SparkContext was added recently. This results in ugly 
> errors when a streaming application is terminated by ctrl-C.
> {code}
> Exception in thread "Thread-27" org.apache.spark.SparkException: Job 
> cancelled because SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:736)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:735)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:735)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1468)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1403)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1642)
>   at 
> org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:559)
>   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2218)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}
> This is because the Spark's shutdown hook stops the context, and the 
> streaming jobs fail in the middle. The correct solution is to stop the 
> streaming context before the spark context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7783) Add rollup and cube support to DataFrame Python DSL

2015-05-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7783.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Add rollup and cube support to DataFrame Python DSL
> ---
>
> Key: SPARK-7783
> URL: https://issues.apache.org/jira/browse/SPARK-7783
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-05-21 Thread Ai He (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ai He updated SPARK-7810:
-
External issue URL:   (was: https://github.com/apache/spark/pull/6338)

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-05-21 Thread Ai He (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ai He updated SPARK-7810:
-
External issue URL: https://github.com/apache/spark/pull/6338

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7810:
---

Assignee: (was: Apache Spark)

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555361#comment-14555361
 ] 

Apache Spark commented on SPARK-7810:
-

User 'AiHe' has created a pull request for this issue:
https://github.com/apache/spark/pull/6338

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7810:
---

Assignee: Apache Spark

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Assignee: Apache Spark
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-05-21 Thread Ai He (JIRA)

Ai He created SPARK-7810:


 Summary: rdd.py "_load_from_socket" cannot load data from jvm 
socket if ipv6 is used
 Key: SPARK-7810
 URL: https://issues.apache.org/jira/browse/SPARK-7810
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He


Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
is used. The current method only works well with ipv4. New modification should 
work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6387) HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6387.
-
Resolution: Won't Fix

We aren't building with Hive 12 anymore.

> HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0
> ---
>
> Key: SPARK-6387
> URL: https://issues.apache.org/jira/browse/SPARK-6387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Cheng Lian
>
> Reproduction steps:
> # Compile Spark against Hive 0.12.0
>   {noformat}$ ./build/sbt 
> -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 
> -Dhadoop.version=2.4.1 clean assembly/assembly{noformat}
> # Start the Thrift server in HTTP mode
>   Add the following stanza in {{hive-site.xml}}:
>   {noformat}
>   hive.server2.transport.mode
>   http
> {noformat}
>   and
>   {noformat}$ ./bin/start-thriftserver.sh{noformat}
> # Connect to the Thrift server via Beeline
>   {noformat}$ ./bin/beeline -u 
> "jdbc:hive2://localhost:10001/default?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice"{noformat}
> # Execute any query and check the server log
>   We can see that no query execution related logs are output.
> The reason is that, when running under HTTP mode, although we pass in a 
> {{SparkSQLCLIService}} instance 
> ([here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L102])
>  to {{ThriftHttpCLIService}}, Hive 0.12.0 just ignores it, and instantiate a 
> new {{CLIService}} 
> ([here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/ThriftHttpCLIService.java#L91-L92]
>  and 
> [here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/EmbeddedThriftBinaryCLIService.java#L32]).
> Notice that while compiling against Hive 0.13.1, Spark SQL doesn't suffer 
> from this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7684) TestHive.reset complains Database does not exist: default

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7684:

Assignee: Cheng Lian

> TestHive.reset complains Database does not exist: default
> -
>
> Key: SPARK-7684
> URL: https://issues.apache.org/jira/browse/SPARK-7684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>
> To see the error, try {{test-only 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite}}. You will see
> {code}
> 19:23:30.487 ERROR org.apache.spark.sql.hive.test.TestHive: FATAL ERROR: 
> Failed to reset TestDB state.
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database 
> does not exist: default
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:425)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:94)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:433)
>   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.afterEach(MetastoreDataSourcesSuite.scala:43)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
>   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.afterEach(MetastoreDataSourcesSuite.scala:40)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
>   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.afterEach(MetastoreDataSourcesSuite.scala:40)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
>   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:40)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2178) createSchemaRDD is not thread safe

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2178.
-
Resolution: Later

This has been a problem since Spark SQL 1.0 and we haven't heard a lot of 
complaints.  Furthermore Scala 2.10 (which is the only version that should have 
the problem) isn't making new releases anymore.  Macros would be nice, but they 
aren't pressing enough to leave this open for now.

> createSchemaRDD is not thread safe
> --
>
> Key: SPARK-2178
> URL: https://issues.apache.org/jira/browse/SPARK-2178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>
> This is because implicit type tags are not thread safe.  We could fix this 
> with compile time macros (which could also make the conversion a lot faster).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3184) Allow user to specify num tasks to use for a table

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3184.
-
Resolution: Won't Fix

> Allow user to specify num tasks to use for a table
> --
>
> Key: SPARK-3184
> URL: https://issues.apache.org/jira/browse/SPARK-3184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andy Konwinski
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5494.
-
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Michael Armbrust

> SparkSqlSerializer Ignores KryoRegistrators
> ---
>
> Key: SPARK-5494
> URL: https://issues.apache.org/jira/browse/SPARK-5494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Hamel Ajay Kothari
>Assignee: Michael Armbrust
> Fix For: 1.4.0
>
>
> We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
> it's custom stuff in order to make sure it picks up on custom 
> KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555326#comment-14555326
 ] 

Ram Sriharsha commented on SPARK-7404:
--

ah perfect, didn't notice RegressionMetrics in codebase. that is great!

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute

2015-05-21 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-7809:


 Summary: MultivariateOnlineSummarizer should allow users to 
configure what to compute
 Key: SPARK-7809
 URL: https://issues.apache.org/jira/browse/SPARK-7809
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng


Now MultivariateOnlineSummarizer computes every summary statistics it can 
provide, which is okay and convenient for small number of features. It the 
feature dimension is large, this becomes expensive. So we should add setters to 
allow users to configure what to compute.

{code}
val summarizer = new MultivariateOnlineSummarizer()
  .withMean(false)
  .withMax(false)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555313#comment-14555313
 ] 

Burak Yavuz commented on SPARK-7785:


My belief on the Python linalg api so far has been that in Python, you have two 
beautiful libraries called numpy and scipy. The overhead of 
serialization-deserialization is not worth the implementation of wrappers, 
because numpy and scipy are backed by C and have vectorization and such.

For linalg, we have been leveraging Breeze for a long time, only adding stuff 
for when the performance can be greatly improved. If we can obtain better 
performance from numpy and scipy, then let's just leverage them. Most of these 
methods were already named very similar to numpy and scipy anyway.

> Add missing items to pyspark.mllib.linalg.Matrices
> --
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7807) High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource()

2015-05-21 Thread Norman He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555312#comment-14555312
 ] 

Norman He commented on SPARK-7807:
--

Running spark in local mode or mesos cluster, The hadoopConfiguration need to 
know all the core-site.xml and hdfs-site.xml from some http://url service. Due 
to bundle issue (there are  a lot of core-site.xml and hdfs-site.xml in all 
kinds of testing jar), the spark instantiate hadoopConfiguration won't be able 
to pick up the correct resource in HDFS high availabity setup.

Add the spark.hadoop.url support can be one way to  solve this issue and the 
clean way.

> High-Availablity:: SparkHadoopUtil.scala should support  
> hadoopConfiguration.addResource()
> --
>
> Key: SPARK-7807
> URL: https://issues.apache.org/jira/browse/SPARK-7807
> Project: Spark
>  Issue Type: Improvement
> Environment: running spark against remote-hadoop HA cluster. Easy of 
> use with spark.hadoop.url. prefix.
> 1) user can support sparkConf with prefix spark.hadoop.url. like 
> spark.hadoop.url.core-site 
> and spark.hadoop.url.hdfs-site 
>Reporter: Norman He
>Priority: Trivial
>  Labels: easyfix
>
> line 97 : should below should be able to change to 
> conf.getAll.foreach { case (key, value) =>
> if (key.startsWith("spark.hadoop.")) {
>   hadoopConf.set(key.substring("spark.hadoop.".length), value)
> }
>   }
> new version---
>   conf.getAll.foreach { case (key, value) =>
> if (key.startsWith("spark.hadoop.")) {
>   if( key.startsWith("spark.hadoop.url.")) 
>hadoopConf.addResource(new URL(value))
>   else
>   hadoopConf.set(key.substring("spark.hadoop.".length), value)
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555311#comment-14555311
 ] 

Xiangrui Meng commented on SPARK-7404:
--

I think we only need to wrap `RegressionMetrics` from the `spark.mllib` 
package, which provides R2, RMSE, and MAE.

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555310#comment-14555310
 ] 

Xiangrui Meng commented on SPARK-7785:
--

In PySpark, we should delegate all local linear algebra operations to numpy and 
scipy. For the factory methods, users should use numpy/scipy directly.

> Add missing items to pyspark.mllib.linalg.Matrices
> --
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4

2015-05-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629
 ] 

Xiangrui Meng edited comment on SPARK-7535 at 5/21/15 11:49 PM:


Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc (SPARK-7808)
10. param and getParam should be final
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting. (SPARK-7794)
13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.


was (Author: mengxr):
Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc
10. param and getParam should be final
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting. (SPARK-7794)
13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.

> Audit Pipeline APIs for 1.4
> ---
>
> Key: SPARK-7535
> URL: https://issues.apache.org/jira/browse/SPARK-7535
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> This is an umbrella for auditing the Pipeline (spark.ml) APIs.  Items to 
> check:
> * Public/protected/private access
> * Consistency across spark.ml
> * Classes, methods, and parameters in spark.mllib but missing in spark.ml
> ** We should create JIRAs for each of these (under an umbrella) as to-do 
> items for future releases.
> For each algorithm or API component, create a subtask under this umbrella.  
> Some major new items:
> * new feature transformers
> * tree models
> * elastic-net
> * ML attributes
> * developer APIs (Predictor, Classifier, Regressor)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7808) Package doc for spark.ml.feature

2015-05-21 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-7808:


 Summary: Package doc for spark.ml.feature
 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng


We added several feature transformers in Spark 1.4. It would be great to add 
package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4

2015-05-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629
 ] 

Xiangrui Meng edited comment on SPARK-7535 at 5/21/15 11:45 PM:


Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc
10. param and getParam should be final
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting. (SPARK-7794)
13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794)
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.


was (Author: mengxr):
Some notes:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage 
already does.
2. @varargs to setDefault (SPARK-7498)
3. Move Evaluator to ml.evaluation.
4. Mention larger metrics are better.
5. PipelineModel doc. “compiled” -> “fitted”
6. Remove Params.validateParams(paramMap)?
7. UnresolvedAttribute (Java compatibility?)
8. Missing RegressionEvaluator (SPARK-7404)
9. ml.feature missing package doc
10. param and getParam should be final
11. Hide PolynomialExpansion.expand
12. Update RegexTokenizer default setting.
13. Mention `RegexTokenizer` in `Tokenizer`.
14. Hide VectorAssembler.
15. Word2Vec.minCount -> @param
16. ParamValidators -> DeveloperApi
17. Params -> @DeveloperApi
18. ALS -> use dataframes to store user/item factors? Then we can hide 
ALS.Rating
19. ALSModel -> remove training parameters?
20. Hide MetadataUtils/SchemaUtils.

> Audit Pipeline APIs for 1.4
> ---
>
> Key: SPARK-7535
> URL: https://issues.apache.org/jira/browse/SPARK-7535
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> This is an umbrella for auditing the Pipeline (spark.ml) APIs.  Items to 
> check:
> * Public/protected/private access
> * Consistency across spark.ml
> * Classes, methods, and parameters in spark.mllib but missing in spark.ml
> ** We should create JIRAs for each of these (under an umbrella) as to-do 
> items for future releases.
> For each algorithm or API component, create a subtask under this umbrella.  
> Some major new items:
> * new feature transformers
> * tree models
> * elastic-net
> * ML attributes
> * developer APIs (Predictor, Classifier, Regressor)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7807) High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource()

2015-05-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7807:
-
Issue Type: Improvement  (was: Bug)

Can you explain this change more and why it would be useful?

> High-Availablity:: SparkHadoopUtil.scala should support  
> hadoopConfiguration.addResource()
> --
>
> Key: SPARK-7807
> URL: https://issues.apache.org/jira/browse/SPARK-7807
> Project: Spark
>  Issue Type: Improvement
> Environment: running spark against remote-hadoop HA cluster. Easy of 
> use with spark.hadoop.url. prefix.
> 1) user can support sparkConf with prefix spark.hadoop.url. like 
> spark.hadoop.url.core-site 
> and spark.hadoop.url.hdfs-site 
>Reporter: Norman He
>Priority: Trivial
>  Labels: easyfix
>
> line 97 : should below should be able to change to 
> conf.getAll.foreach { case (key, value) =>
> if (key.startsWith("spark.hadoop.")) {
>   hadoopConf.set(key.substring("spark.hadoop.".length), value)
> }
>   }
> new version---
>   conf.getAll.foreach { case (key, value) =>
> if (key.startsWith("spark.hadoop.")) {
>   if( key.startsWith("spark.hadoop.url.")) 
>hadoopConf.addResource(new URL(value))
>   else
>   hadoopConf.set(key.substring("spark.hadoop.".length), value)
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql

2015-05-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7758:

Assignee: Cheng Lian

> Failed to start thrift server when metastore is postgre sql
> ---
>
> Key: SPARK-7758
> URL: https://issues.apache.org/jira/browse/SPARK-7758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Tao Wang
>Assignee: Cheng Lian
>Priority: Blocker
> Attachments: hive-site.xml, with error.log, with no error.log
>
>
> I am using today's master branch to start thrift server with setting 
> metastore to postgre sql, and it shows error like:
> {code}
> 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
> 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
> DELETEME1432125837197 CASCADE : Syntax error: Encountered "CASCADE" at line 
> 1, column 34.
> java.sql.SQLSyntaxErrorException: Syntax error: Encountered "CASCADE" at line 
> 1, column 34.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at 
> org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
>   at 
> org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`
> But it works well with earlier master branch (on 7th, April).
> After printing their debug level log, I found current branch tries to connect 
> with derby but didn't know why, maybe the big reconstructure in sql module 
> cause this issue.
> The Datastore shows in current branch:
> 15/05/20 20:43:57 DEBUG Datastore: === Datastore 
> =
> 15/05/20 20:43:57 DEBUG Datastore: StoreManager : "rdbms" 
> (org.datanucleus.store.rdbms.RDBMSStoreManager)
> 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
> 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
> Validate(None)
> 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
> STOREDPROC]
> 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
> 15/05/20 20:43:57 DEBUG Datastore: 
> ===
> 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
> org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
> 15/05/20 20:43:57 DEBUG Datastore: Datastore : name="Apache Derby" 
> version="10.10.1.1 - (1458268)"
> 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name="Apache Derby 
> Embedded JDBC Driver" version="10.10.1.1 - (1458268)"
> 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
> URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
> 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
> URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
> 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
> factory="datanucleus1" case=UPPERCASE catalog= schema=SPARK
> 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : "MixedCase" 
> UPPERCASE "MixedCase-Sensitive" 
> 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
> Table=128 Column=128 Constraint=128 Index=128 Delimiter="
> 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
> catalog=false schema=true
> 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
> rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
> 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
> (max-batch-size=50)
> 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
> type=forward-only, concurrency=read-only
> 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
> 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
> DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
> VARBINARY, JAVA_OBJECT
> 15/05/20 20:43:57 DEBUG Datastore: 
> ===
> The Datastore in earlier master branch:
> 15/05/20 20:18:10 D

[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555285#comment-14555285
 ] 

Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:36 PM:


scikit learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural first metrics to make available via the 
Evaluator.


was (Author: rams):
scikit learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555285#comment-14555285
 ] 

Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:35 PM:


scikit learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.


was (Author: rams):
sickout learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555285#comment-14555285
 ] 

Ram Sriharsha commented on SPARK-7404:
--

sickout learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6289) PySpark doesn't maintain SQL date Types

2015-05-21 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555286#comment-14555286
 ] 

Davies Liu commented on SPARK-6289:
---

This will be fixed by upgrading to Pyrolite 4.6, which will pickle 
java.sql.Date as datetime.date

> PySpark doesn't maintain SQL date Types
> ---
>
> Key: SPARK-6289
> URL: https://issues.apache.org/jira/browse/SPARK-6289
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.1
>Reporter: Michael Nazario
>Assignee: Davies Liu
>
> For the DateType, Spark SQL requires a datetime.date in Python. However, if 
> you collect a row based on that type, you'll end up with a returned value 
> which is type datetime.datetime.
> I have tried to reproduce this using the pyspark shell, but have been unable 
> to. This is definitely a problem coming from pyrolite though:
> https://github.com/irmen/Pyrolite/
> Pyrolite is being used for datetime and date serialization, but appears to 
> not map to date objects, but maps to datetime objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7624) Task scheduler delay is increasing time over time in spark local mode

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555284#comment-14555284
 ] 

Apache Spark commented on SPARK-7624:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6337

> Task scheduler delay is increasing time over time in spark local mode
> -
>
> Key: SPARK-7624
> URL: https://issues.apache.org/jira/browse/SPARK-7624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Jack Hu
>Assignee: Davies Liu
>  Labels: delay, schedule
> Fix For: 1.4.0
>
>
> I am running a simple spark streaming program with spark 1.3.1 in local mode, 
> it receives json string from a socket with rate 50 events per second, it can 
> run well in first 6 hours (although the minor gc count per minute is 
> increasing all the time), after that, i can see that the scheduler delay in 
> every task is significant increased from 10 ms to 100 ms, after 10 hours 
> running, the task delay is about 800 ms and cpu is also increased from 2% to 
> 30%. This causes the steaming job can not finish in one batch interval (5 
> seconds). I dumped the java memory after 16 hours and can see there are about 
> 20 {{org.apache.spark.scheduler.local.ReviveOffers}} objects in 
> {{akka.actor.LightArrayRevolverScheduler$TaskQueue[]}}. Then i checked the 
> code and see only one place may put the {{ReviveOffers}} to akka 
> {{LightArrayRevolverScheduler}}: the {{LocalActor::reviveOffers}}
> {code}
>  def reviveOffers() {
> val offers = Seq(new WorkerOffer(localExecutorId, localExecutorHostname, 
> freeCores))
> val tasks = scheduler.resourceOffers(offers).flatten
> for (task <- tasks) {
>   freeCores -= scheduler.CPUS_PER_TASK
>   executor.launchTask(executorBackend, taskId = task.taskId, 
> attemptNumber = task.attemptNumber,
> task.name, task.serializedTask)
> }
> if (tasks.isEmpty && scheduler.activeTaskSets.nonEmpty) {
>   // Try to reviveOffer after 1 second, because scheduler may wait for 
> locality timeout
>   context.system.scheduler.scheduleOnce(1000 millis, self, ReviveOffers)
> }
> }
> {code}
> I removed the last three lines in this method (the whole {{if}} block, which 
> is introduced from https://issues.apache.org/jira/browse/SPARK-4939), it 
> worked smooth after 20 hours running, the scheduler delay is about 10 ms all 
> the time. So there should have some conditions that the ReviveOffers will be 
> duplicate scheduled? I am not sure why this happens, but i feel that this is 
> the root cause of this issue. 
> My spark settings:
> #  Memor: 3G
> # CPU: 8 cores 
> # Streaming Batch interval: 5 seconds.  
> Here are my streaming code:
> {code}
> val input = ssc.socketTextStream(
>   hostname, port, StorageLevel.MEMORY_ONLY_SER).mapPartitions(
>   /// parse the json to Order
>   Order(_), preservePartitioning = true)
> val mresult = input.map(
>   v => (v.customer, UserSpending(v.customer, v.count * v.price, 
> v.timestamp.toLong))).cache()
> val tempr  = mresult.window(
> Seconds(firstStageWindowSize), 
> Seconds(firstStageWindowSize)
>   ).transform(
> rdd => rdd.union(rdd).union(rdd).union(rdd)
>   )
> tempr.count.print
> tempr.cache().foreachRDD((rdd, t) => {
> for (i <- 1 to 5) {
>   val c = rdd.filter(x=>scala.util.Random.nextInt(5) == i).count()
>   println("""T: """ + t + """: """ + c)
> }
>   })
> {code}
> 
> Updated at 2015-05-15
> I did print some detail schedule times of the suspect lines in 
> {{LocalActor::reviveOffers}}: {color:red}*1685343501*{color} times after 18 
> hours running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode

2015-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555282#comment-14555282
 ] 

Apache Spark commented on SPARK-4939:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6337

> Python updateStateByKey example hang in local mode
> --
>
> Key: SPARK-4939
> URL: https://issues.apache.org/jira/browse/SPARK-4939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, Streaming
>Affects Versions: 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.2, 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555280#comment-14555280
 ] 

Joseph K. Bradley commented on SPARK-7785:
--

Ping [~brkyvz] [~mengxr] to weigh in on how many linalg methods we want to add 
to our PySpark classes

> Add missing items to pyspark.mllib.linalg.Matrices
> --
>
> Key: SPARK-7785
> URL: https://issues.apache.org/jira/browse/SPARK-7785
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> For DenseMatrices.
> Class Methods
> __str__, transpose
> Object Methods
> zeros, ones, eye, rand, randn, diag
> For SparseMatrices
> Class Methods
> __str__, transpose
> Object Methods,
> fromCoo, speye, sprand, sprandn, spdiag,
> Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7807) High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource()

2015-05-21 Thread Norman He (JIRA)

Norman He created SPARK-7807:


 Summary: High-Availablity:: SparkHadoopUtil.scala should support  
hadoopConfiguration.addResource()
 Key: SPARK-7807
 URL: https://issues.apache.org/jira/browse/SPARK-7807
 Project: Spark
  Issue Type: Bug
 Environment: running spark against remote-hadoop HA cluster. Easy of 
use with spark.hadoop.url. prefix.

1) user can support sparkConf with prefix spark.hadoop.url. like 
spark.hadoop.url.core-site 
and spark.hadoop.url.hdfs-site 

Reporter: Norman He
Priority: Trivial


line 97 : should below should be able to change to 
conf.getAll.foreach { case (key, value) =>
if (key.startsWith("spark.hadoop.")) {
  hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
  }

new version---
  conf.getAll.foreach { case (key, value) =>
if (key.startsWith("spark.hadoop.")) {
  if( key.startsWith("spark.hadoop.url.")) 
   hadoopConf.addResource(new URL(value))
  else
  hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
  }





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-7404:


Assignee: Ram Sriharsha

> Add RegressionEvaluator to spark.ml
> ---
>
> Key: SPARK-7404
> URL: https://issues.apache.org/jira/browse/SPARK-7404
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Ram Sriharsha
>
> This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7806) spark-ec2 launch script fails for Python3

2015-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7806:
---

Assignee: (was: Apache Spark)

> spark-ec2 launch script fails for Python3
> -
>
> Key: SPARK-7806
> URL: https://issues.apache.org/jira/browse/SPARK-7806
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 1.3.1
> Environment: All platforms.  
>Reporter: Matthew Goodman
>Priority: Minor
>
> Depending on the options used the spark-ec2 script will terminate 
> ungracefully.  
> Relevant buglets include:
>  - urlopen() returning bytes vs. string
>  - floor division change for partition calculation
>  - filter() iteration behavior change in module calculation
> I have a fixed version that I wish to contribute.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo