date:20160419

[jira] [Updated] (SPARK-6174) Improve doc: Python ALS, MatrixFactorizationModel

2016-04-19 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-6174:
--
Component/s: (was: Documentation)

> Improve doc: Python ALS, MatrixFactorizationModel
> -
>
> Key: SPARK-6174
> URL: https://issues.apache.org/jira/browse/SPARK-6174
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> The Python docs for recommendation have almost no content except an example.  
> Add class, method & attribute descriptions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6174) Improve doc: Python ALS, MatrixFactorizationModel

2016-04-19 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247288#comment-15247288
 ] 

Nick Pentreath commented on SPARK-6174:
---

[~josephkb] I think SPARK-12632 took care of most of this. If you agree let's 
close this ticket.

> Improve doc: Python ALS, MatrixFactorizationModel
> -
>
> Key: SPARK-6174
> URL: https://issues.apache.org/jira/browse/SPARK-6174
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> The Python docs for recommendation have almost no content except an example.  
> Add class, method & attribute descriptions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-14725:
---

 Summary: Remove HttpServer
 Key: SPARK-14725
 URL: https://issues.apache.org/jira/browse/SPARK-14725
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Saisai Shao


{{HttpServer}} used to support broadcast variables and jars/files transmission 
now seems obsolete, by searching the codes, actually no one class depends on it 
except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247293#comment-15247293
 ] 

Saisai Shao commented on SPARK-14725:
-

What's your opinion [~rxin] [~andrewor14]?

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Saisai Shao
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-19 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-14398.
---
Resolution: Resolved
  Assignee: Bo Meng

> Audit non-reserved keyword list in ANTLR4 parser.
> -
>
> Key: SPARK-14398
> URL: https://issues.apache.org/jira/browse/SPARK-14398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Bo Meng
>
> We need to check if all keywords that were non-reserved in the `old` ANTLR3 
> parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
> {{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and 
> are now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247298#comment-15247298
 ] 

Reynold Xin commented on SPARK-14725:
-

Doesn't the REPL use it?


> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Saisai Shao
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247316#comment-15247316
 ] 

Saisai Shao commented on SPARK-14725:
-

I think now it changes to use RPC instead of Http, please see this SPARK-11563, 
also by searching the code, only one test case {{ExecutorClassLoaderSuite}} 
uses this {{HttpServer}}.

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Saisai Shao
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247317#comment-15247317
 ] 

Reynold Xin commented on SPARK-14725:
-

Does getClassFileInputStreamFromHttpServer not use it?

If we are not actually using it please go ahead. BTW this shouldn't be in the 
2.0 deprecation ticket since it is not user facing.


> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Saisai Shao
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247318#comment-15247318
 ] 

Saisai Shao commented on SPARK-14725:
-

OK, let me check first :).

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Saisai Shao
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-14725:

Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-11806)

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-14725:

Priority: Minor  (was: Major)

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12919) Implement dapply() on DataFrame in SparkR

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247319#comment-15247319
 ] 

Apache Spark commented on SPARK-12919:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/12493

> Implement dapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12919
> URL: https://issues.apache.org/jira/browse/SPARK-12919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> dapply() applies an R function on each partition of a DataFrame and returns a 
> new DataFrame.
> The function signature is:
> {code}
>   dapply(df, function(localDF) {}, schema = NULL)
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame
> Schema specifies the Row format of the resulting DataFrame. It must match the 
> R function's output.
> If schema is not specified, each partition of the result DataFrame will be 
> serialized in R into a single byte array. Such resulting DataFrame can be 
> processed by successive calls to dapply() or collect(), but can't be 
> processed by normal DataFrame operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12919) Implement dapply() on DataFrame in SparkR

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12919:


Assignee: Apache Spark

> Implement dapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12919
> URL: https://issues.apache.org/jira/browse/SPARK-12919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> dapply() applies an R function on each partition of a DataFrame and returns a 
> new DataFrame.
> The function signature is:
> {code}
>   dapply(df, function(localDF) {}, schema = NULL)
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame
> Schema specifies the Row format of the resulting DataFrame. It must match the 
> R function's output.
> If schema is not specified, each partition of the result DataFrame will be 
> serialized in R into a single byte array. Such resulting DataFrame can be 
> processed by successive calls to dapply() or collect(), but can't be 
> processed by normal DataFrame operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12919) Implement dapply() on DataFrame in SparkR

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12919:


Assignee: (was: Apache Spark)

> Implement dapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12919
> URL: https://issues.apache.org/jira/browse/SPARK-12919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> dapply() applies an R function on each partition of a DataFrame and returns a 
> new DataFrame.
> The function signature is:
> {code}
>   dapply(df, function(localDF) {}, schema = NULL)
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame
> Schema specifies the Row format of the resulting DataFrame. It must match the 
> R function's output.
> If schema is not specified, each partition of the result DataFrame will be 
> serialized in R into a single byte array. Such resulting DataFrame can be 
> processed by successive calls to dapply() or collect(), but can't be 
> processed by normal DataFrame operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247357#comment-15247357
 ] 

Sean Owen commented on SPARK-14703:
---

To be more specific, I mean leaving log4j in the build so it can link 
correctly, but then routing actual calls to another logging backend. That 
should "work" but Spark logger manipulation will do nothing. (There's an 
outside chance I've missed why just doing that doesn't quite work.)

Why logback BTW? doesn't that have the same problem, for any logging 
implementation?

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247358#comment-15247358
 ] 

Saisai Shao commented on SPARK-14725:
-

I just search the repl code, from my understanding seems currently there's no 
code code using {{HttpServer}}, seems safe to remove it. [~vanzin], what's your 
comment, l think you changes this part.

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247358#comment-15247358
 ] 

Saisai Shao edited comment on SPARK-14725 at 4/19/16 8:13 AM:
--

I just search the repl code, from my understanding seems currently there's no 
code using {{HttpServer}}, seems safe to remove it. [~vanzin], what's your 
comment, l think you changes this part.


was (Author: jerryshao):
I just search the repl code, from my understanding seems currently there's no 
code code using {{HttpServer}}, seems safe to remove it. [~vanzin], what's your 
comment, l think you changes this part.

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14326) Can't specify "long" type in structField

2016-04-19 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247360#comment-15247360
 ] 

Sun Rui commented on SPARK-14326:
-

we could add support for "bigint" type in structField. The question is is it 
meaningful since R has no built-in support for 64-bit integers? Could you give 
more details on your use case for "long" type?

> Can't specify "long" type in structField
> 
>
> Key: SPARK-14326
> URL: https://issues.apache.org/jira/browse/SPARK-14326
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Dmitriy Selivanov
>
> tried `long`, `bigint`, `LongType`, `Long`. Nothing works...
> {code}
> schema <- structType(structField("id", "long"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2016-04-19 Thread Bomi Kim (JIRA)

Bomi Kim created SPARK-14726:


 Summary: Support for sampling when inferring schema in CSV data 
source
 Key: SPARK-14726
 URL: https://issues.apache.org/jira/browse/SPARK-14726
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Bomi Kim


Currently, I am using CSV data source and trying to get used to Spark 2.0 
because it has built-in CSV data source.

I realized that CSV data source infers schema with all the data. JSON data 
source supports sampling ratio option.

It would be great if CSV data source has this option too (or is this supported 
already?).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2016-04-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247378#comment-15247378
 ] 

Hyukjin Kwon commented on SPARK-14726:
--

This is currently not supported. I can work on this but I feel a bit hesitating 
because I believe CSV data source is ported mainly for "small data world". But 
I believe there are a lot of users dealing with large CSV files. 
I will work on this if it is decided to be supported. [~rxin]

> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14723) A new way to support dynamic allocation in Spark Streaming

2016-04-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247384#comment-15247384
 ] 

Saisai Shao commented on SPARK-14723:
-

It would be better not to set fix version and target version.

> A new way to support dynamic allocation in Spark Streaming
> --
>
> Key: SPARK-14723
> URL: https://issues.apache.org/jira/browse/SPARK-14723
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: WilliamZhu
>  Labels: features
> Fix For: 2.1.0
>
> Attachments: spark-streaming-dynamic-allocation-desigh.pdf
>
>
> Provide a more powerful Algorithm to support dynamic allocation in spark 
> streaming.
> more details: http://www.jianshu.com/p/ae7fdd4746f6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247399#comment-15247399
 ] 

Apache Spark commented on SPARK-13266:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/12494

> Python DataFrameReader converts None to "None" instead of null
> --
>
> Key: SPARK-13266
> URL: https://issues.apache.org/jira/browse/SPARK-13266
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Linux standalone but probably applies to all
>Reporter: mathieu longtin
>  Labels: easyfix, patch
>
> If you do something like this:
> {code:none}
> tsv_loader = sqlContext.read.format('com.databricks.spark.csv')
> tsv_loader.options(quote=None, escape=None)
> {code}
> The loader sees the string "None" as the _quote_ and _escape_ options. The 
> loader should get a _null_.
> An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, 
> correct the _to_str_ function. Here's the patch:
> {code:none}
> diff --git a/python/pyspark/sql/readwriter.py 
> b/python/pyspark/sql/readwriter.py
> index a3d7eca..ba18d13 100644
> --- a/python/pyspark/sql/readwriter.py
> +++ b/python/pyspark/sql/readwriter.py
> @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"]
>  def to_str(value):
>  """
> -A wrapper over str(), but convert bool values to lower case string
> +A wrapper over str(), but convert bool values to lower case string, and 
> keep None
>  """
>  if isinstance(value, bool):
>  return str(value).lower()
> +elif value is None:
> +return value
>  else:
>  return str(value)
> {code}
> This has been tested and works great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14723) A new way to support dynamic allocation in Spark Streaming

2016-04-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14723:
--
Target Version/s:   (was: 2.1.0)
  Labels:   (was: features)
   Fix Version/s: (was: 2.1.0)

I'm not sure I agree that the goal of a streaming job is to make the processing 
time exactly fit the batch interval. Finishing sooner is generally the goal, 
and taking nowhere near the batch interval to finish for safety. That is, I 
think the current behavior -- adding executors when there's work to do -- is 
more ideal than hoping to guess the right number of executors based on previous 
runtimes.

> A new way to support dynamic allocation in Spark Streaming
> --
>
> Key: SPARK-14723
> URL: https://issues.apache.org/jira/browse/SPARK-14723
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: WilliamZhu
> Attachments: spark-streaming-dynamic-allocation-desigh.pdf
>
>
> Provide a more powerful Algorithm to support dynamic allocation in spark 
> streaming.
> more details: http://www.jianshu.com/p/ae7fdd4746f6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Darshan Mehta (JIRA)

Darshan Mehta created SPARK-14727:
-

 Summary: NullPointerException while trying to launch local spark 
job
 Key: SPARK-14727
 URL: https://issues.apache.org/jira/browse/SPARK-14727
 Project: Spark
  Issue Type: Bug
Reporter: Darshan Mehta


OS : Windows 10
Spark Version : 1.6.1
Java version : 1.8

I am trying to launch a simple Spark job from eclipse, after starting spark 
master and registering one worker. JavaRDDs are created successfully, however, 
a NPE is thrown while collect() operation is executed. Below are the steps that 
I performed:

1. Downloaded Spark 1.6.1
2. Built it locally with 'sbt package' and 'sbt assembly' commands 
3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
spark://master:7077 -c 2'
5. Verified both Master and Worker are up, and have enough resources in Spark UI
6. Created a maven project in eclipse, with spark dependency
7. Executed attached "SparkCrud.java" in eclipse
8. NPE is thrown, logs are attached "Logs.log"

It seems it's trying to execute Hadoop binaries, however, I am not using Hadoop 
anywhere at all. Also, I tried placing winutil.exe in C:\\ and configured 
"hadoop.home.dir" System property (as suggested in another JIRA), however that 
doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Darshan Mehta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshan Mehta updated SPARK-14727:
--
Attachment: SparkCrud.java

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Darshan Mehta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshan Mehta updated SPARK-14727:
--
Attachment: Logs.log

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2016-04-19 Thread Matthew Byng-Maddick (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247424#comment-15247424
 ] 

Matthew Byng-Maddick commented on SPARK-14703:
--

Leaving log4j in the build and then routing the calls at runtime gives the 
stacktrace above when trying to construct the sparkContext (which means you 
never actually get a sparkContext, leaving the shell, at least, unusable. As 
best I can tell, it's because the class method for 
((org.apache.log4j.Logger)Logger).getLogger() doesn't actually give you back an 
org.apache.log4j.Logger in that situation, but gives you back a 
ch.qos.logback.Logger instead, which means that there's no polymorphic 
setLevel() with an org.apache.log4j.Level as an argument.

I have to admit to being a little unclear myself, only that doing the class 
matching and adding logback, which seems to be popular, does appear to fix the 
problems and allow the sparkContext both standalone and under yarn to be able 
to be constructed (in particular within the spark-shell, but presumably in 
general).

As to "why logback" - I think I mentioned this, but really 2 reasons for us:
1) online-updatable logging configs (mean that we can changing logging config 
without restarting services
2) we have a bunch of logstash/elasticsearch infrastructure, and logback (with 
a connector class) can natively write directly to logstash instead of writing 
locally and then having to have an agent pick up the data. This allows us to 
collate and correlate our hadoop and hbase logs across the cluster.

Thanks for engaging, and I hope that even if we don't solve the problem in this 
way we can at least get to a point where we're not reliant on using log4j at 
runtime to be able to even use Spark (even if not all the logger config 
features are enabled).

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14600) Push predicates through Expand

2016-04-19 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247427#comment-15247427
 ] 

Wenchen Fan commented on SPARK-14600:
-

working on it

> Push predicates through Expand
> --
>
> Key: SPARK-14600
> URL: https://issues.apache.org/jira/browse/SPARK-14600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping 
> attributes came from Project, but have different meaning in Project (equal to 
> original grouping expression) and Expand (could be original grouping 
> expression or null), this does not make sense, because the attribute has 
> different result in different operator,
>  A better way could be  Aggregate(Expand()), then we need to  fix SQL 
> generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247431#comment-15247431
 ] 

Sean Owen commented on SPARK-13944:
---

I take the point about some models not being representable in PMML, though I 
think that argues to use PMML where it can be used, and only make up something 
where that's impossible. It's full of "" points to stick in 
arbitrary additional data, so I think it can end up carrying whatever 
variations you want from model building to model scoring. (Of course, custom 
extensions would only make sense to your custom scoring code, but, so would any 
custom format you imagine.)

The problem is you're moving from a world you describe as "can't use MLlib for 
scoring" to a world I'd describe as "must use MLlib for scoring" -- if you mean 
you are making up a new format for models that nothing else will read.

I'm still not sure why this implies you need to build yet another vector and 
matrix library. Example: why not use Breeze internally, as now? (or whatever 
other one you want?) No non-Spark code is already going to have data in this 
bespoke vector/matrix format, so there's a translation step no matter what.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247434#comment-15247434
 ] 

Sean Owen commented on SPARK-14703:
---

Oh, logback tries to reimplement some log4j API methods? It sounds like that 
isn't entirely binary compatible. My guess is that this ends up taking 
precedence over log4j, which is ideally where those calls still route to so 
that they do nothing.

Are you saying you want Spark to depend directly on logback to control log 
levels? Let's say logback has some other equivalent methods and you changed to 
use those in Spark. We get the same problem in the end, but just with logback? 
and then nobody's existing log4j config necessarily works anymore with Spark.

You can update log4j programmatically too (right?) but it does mean calling 
directly into it. 

Your goal is merely to use logback though for your own calls, perhaps. Normally 
if you can plumb SLF4J into logger X you can do that, but are you saying 
logback ends up colliding with log4j no matter which way you put this together?

I tried updating Spark to log4j 2.x and faced a bunch of problems, the most 
serious of which was: every single time a transitive dependency brings in log4j 
1.x classes again, it breaks until it's excluded again. But: if log4j 2 would 
somehow also work for you (being also a log4j 1 successor?) and you want to 
take a run at that again, I can look at that with you.

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14723) A new way to support dynamic allocation in Spark Streaming

2016-04-19 Thread WilliamZhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247435#comment-15247435
 ] 

WilliamZhu commented on SPARK-14723:


I think it would be better to give extra executors the first time ,then reduce 
the number gradually. 
But the reduce action should not be too slow cause sometimes we have massive 
executors.  

the current behavior – adding executors when there's work to do - is not  ideal 
for streaming application especially those with a short duration cause adding 
executors  is slow job. Also,you should consider the situation that the cluster 
have resource starving   temporarily .

We will request the number of spark.streaming.dynamicAllocation.maxExecutors 
executors from Yarn immediately in greedy way once there is any delay happens 
,since we hope we can eliminate the delay as soon as possible .Then we can 
reduce the redundant executors.

> A new way to support dynamic allocation in Spark Streaming
> --
>
> Key: SPARK-14723
> URL: https://issues.apache.org/jira/browse/SPARK-14723
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: WilliamZhu
> Attachments: spark-streaming-dynamic-allocation-desigh.pdf
>
>
> Provide a more powerful Algorithm to support dynamic allocation in spark 
> streaming.
> more details: http://www.jianshu.com/p/ae7fdd4746f6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14727.
---
Resolution: Duplicate

No, it's trying to execute local winutils binaries. I'm all but certain this is 
a duplicate of such issues.

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247448#comment-15247448
 ] 

Hyukjin Kwon edited comment on SPARK-14525 at 4/19/16 9:52 AM:
---

Shouldn't we then deprecate the support for {{read.format("jdbc")}} if 
{{Properties}} is not guaranteed to be converted to {{String}}?


was (Author: hyukjin.kwon):
Shouldn't we then deprecate the support for {{read.format("jdbc")} if 
{{Properties}} is not guaranteed to be converted to {{String}}?

> DataFrameWriter's save method should delegate to jdbc for jdbc datasource
> -
>
> Key: SPARK-14525
> URL: https://issues.apache.org/jira/browse/SPARK-14525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Justin Pihony
>Priority: Minor
>
> If you call {code}df.write.format("jdbc")...save(){code} then you get an 
> error  
> bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
> allow create table as select
> save is a more intuitive guess on the appropriate method to call, so the user 
> should not be punished for not knowing about the jdbc method. 
> Obviously, this will require the caller to have set up the correct parameters 
> for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247448#comment-15247448
 ] 

Hyukjin Kwon commented on SPARK-14525:
--

Shouldn't we then deprecate the support for {{read.format("jdbc")} if 
{{Properties}} is not guaranteed to be converted to {{String}}?

> DataFrameWriter's save method should delegate to jdbc for jdbc datasource
> -
>
> Key: SPARK-14525
> URL: https://issues.apache.org/jira/browse/SPARK-14525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Justin Pihony
>Priority: Minor
>
> If you call {code}df.write.format("jdbc")...save(){code} then you get an 
> error  
> bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
> allow create table as select
> save is a more intuitive guess on the appropriate method to call, so the user 
> should not be punished for not knowing about the jdbc method. 
> Obviously, this will require the caller to have set up the correct parameters 
> for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14728) Add a rule to block the use of getOrElse(null) which can simply be orNull.

2016-04-19 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-14728:


 Summary: Add a rule to block the use of getOrElse(null) which can 
simply be orNull.
 Key: SPARK-14728
 URL: https://issues.apache.org/jira/browse/SPARK-14728
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Trivial


Currently, it looks {{getOrElse(null)}} is being used which just can be 
{{orNull}} sparsely.

As it is the same expression but shorten and clear, I think the rule for this 
might have to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14728) Add a rule to block the use of getOrElse(null) which can simply be orNull.

2016-04-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247451#comment-15247451
 ] 

Hyukjin Kwon commented on SPARK-14728:
--

[~rxin] Do you think it is okay to add a rule? If you are not sure, I can just 
add a PR to fix them without adding a rule.

> Add a rule to block the use of getOrElse(null) which can simply be orNull.
> --
>
> Key: SPARK-14728
> URL: https://issues.apache.org/jira/browse/SPARK-14728
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, it looks {{getOrElse(null)}} is being used which just can be 
> {{orNull}} sparsely.
> As it is the same expression but shorten and clear, I think the rule for this 
> might have to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14728) Add a rule to block the use of getOrElse(null) which can simply be orNull.

2016-04-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247459#comment-15247459
 ] 

Hyukjin Kwon commented on SPARK-14728:
--

Oh, sorry, I noticed not all classes having {{getOrElse}} have {{orNull}}. 
Please allow me close this and just open a minor PR. 

> Add a rule to block the use of getOrElse(null) which can simply be orNull.
> --
>
> Key: SPARK-14728
> URL: https://issues.apache.org/jira/browse/SPARK-14728
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, it looks {{getOrElse(null)}} is being used which just can be 
> {{orNull}} sparsely.
> As it is the same expression but shorten and clear, I think the rule for this 
> might have to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14723) A new way to support dynamic allocation in Spark Streaming

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247470#comment-15247470
 ] 

Sean Owen commented on SPARK-14723:
---

Generally, your solution resembles the "no dynamic allocation" solution. 
Streaming jobs hang on to resources even when they're idle to avoid having to 
re-allocate them at the beginning of the next batch. You're trying to slowly 
release executors though if the batches are finishing early, until they're 
actually running past the batch interval. I think that's problematic, but, you 
could instead target finishing in x% of the batch interval, not 100%.

Holding onto resources when the cluster is busy just means you're starving some 
other job.

For long batch intervals, where the executor startup time is relatively small, 
then current dynamic allocation is even better since it's better still to 
release the resources and re-acquire them for each infrequent batch. 

For short batch intervals, I see the issue, although to some degree current 
dynamic allocation will also ramp down the executors, even if the batch 
interval is shorter than the release intervals. With very little work, some 
executors will be regularly idle and be shutdown already. It will still 
generally end up over-provisioned, but that's roughly a good thing for 
streaming, since running over batch intervals is bad and input is bursty.

I see your arguments though so far I am not sure how much this adds over the 
behavior of existing dynamic allocation, and it's actually worse in long batch 
interval cases.

> A new way to support dynamic allocation in Spark Streaming
> --
>
> Key: SPARK-14723
> URL: https://issues.apache.org/jira/browse/SPARK-14723
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: WilliamZhu
> Attachments: spark-streaming-dynamic-allocation-desigh.pdf
>
>
> Provide a more powerful Algorithm to support dynamic allocation in spark 
> streaming.
> more details: http://www.jianshu.com/p/ae7fdd4746f6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14728) Add a rule to block the use of getOrElse(null) which can simply be orNull.

2016-04-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-14728.

Resolution: Invalid

> Add a rule to block the use of getOrElse(null) which can simply be orNull.
> --
>
> Key: SPARK-14728
> URL: https://issues.apache.org/jira/browse/SPARK-14728
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, it looks {{getOrElse(null)}} is being used which just can be 
> {{orNull}} sparsely.
> As it is the same expression but shorten and clear, I think the rule for this 
> might have to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-19 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247498#comment-15247498
 ] 

Kazuaki Ishizaki commented on SPARK-13904:
--

I agree with you since SPARK-14689 addresses.

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>Assignee: Hemant Bhanawat
> Fix For: 2.0.0
>
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14660) Executors show up active tasks indefinitely after stage is killed

2016-04-19 Thread saurabh paliwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247513#comment-15247513
 ] 

saurabh paliwal commented on SPARK-14660:
-

After some debugging, we found that there was an issue in handleTaskCompletion 
method of DagScheduler class
Previously, it was -

if (!stageIdToStage.contains(task.stageId)) {
  // Skip all the actions if the stage has been cancelled.
}

We changed to -

if (!stageIdToStage.contains(task.stageId)) {
  // Skip all the actions if the stage has been cancelled.
  if(event.reason == Success) {
val attemptId = task.stageAttemptId
listenerBus.post(SparkListenerTaskEnd(stageId, attemptId, taskType, 
event.reason,
  event.taskInfo, event.taskMetrics))
  }
  return
}

This apparently fixed the issue.
The problem was if some task is finished after the cancelTasks is called, the 
stageIdToStage map doesn't contain the entry anymore but we need to send 
taskEnd event anyway.
Please review/comment if it's incorrect.

> Executors show up active tasks indefinitely after stage is killed
> -
>
> Key: SPARK-14660
> URL: https://issues.apache.org/jira/browse/SPARK-14660
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.1
> Environment: YARN
>Reporter: saurabh paliwal
>Priority: Minor
>
> If a job is running and we kill it from all stages UI page, the executors on 
> which the tasks were running for that job keep showing them as active tasks, 
> and the executor won't get lost even after executorIdleTimeout seconds if you 
> have dynamic allocated executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Darshan Mehta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247523#comment-15247523
 ] 

Darshan Mehta commented on SPARK-14727:
---

Stacktrace in SPARK-2356 says the following:
"Could not locate executable null\bin\winutils.exe"

However, there is no such stacktrace when I run the app locally. Also, I have 
tried configuring HADOOP_HOME and HADOOP_CONF_DIR but none of those options 
worked for me. So, I believe this is a different issue.


> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Darshan Mehta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshan Mehta reopened SPARK-14727:
---

Not a duplicate

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247533#comment-15247533
 ] 

Sean Owen commented on SPARK-14727:
---

It's almost certainly the same class of problem. I don't know that HADOOP_HOME 
is relevant.
However have a look at 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html
You may not even have the binaries that are required. Please follow up on that 
before reopening.

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247543#comment-15247543
 ] 

Hemant Bhanawat commented on SPARK-14729:
-

I am looking into this. 

> Implement an existing cluster manager with New ExternalClusterManager 
> interface
> ---
>
> Key: SPARK-14729
> URL: https://issues.apache.org/jira/browse/SPARK-14729
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> SPARK-13904 adds an ExternalClusterManager interface to Spark to allow 
> external cluster managers to spawn Spark components. 
> This JIRA tracks following suggestion from [~rxin]: 
> 'One thing - can you guys try to see if you can implement one of the existing 
> cluster managers with this, and then we can make sure this is a proper API? 
> Otherwise it is really easy to get removed because it is currently unused by 
> anything in Spark.' 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)

Hemant Bhanawat created SPARK-14729:
---

 Summary: Implement an existing cluster manager with New 
ExternalClusterManager interface
 Key: SPARK-14729
 URL: https://issues.apache.org/jira/browse/SPARK-14729
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Hemant Bhanawat
Priority: Minor


SPARK-13904 adds an ExternalClusterManager interface to Spark to allow external 
cluster managers to spawn Spark components. 

This JIRA tracks following suggestion from [~rxin]: 

'One thing - can you guys try to see if you can implement one of the existing 
cluster managers with this, and then we can make sure this is a proper API? 
Otherwise it is really easy to get removed because it is currently unused by 
anything in Spark.' 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Darshan Mehta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshan Mehta closed SPARK-14727.
-
Resolution: Fixed

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-19 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247576#comment-15247576
 ] 

Takeshi Yamamuro commented on SPARK-14525:
--

Yeah, +1. I'm not sure we have a special handling for the `jdbc` source. I 
think it would be better to move all the jdbc functionality of 
`DataFrameWriter` into `jdbc.DefaultSource`, then duplicate 
`read.format("jdbc").

> DataFrameWriter's save method should delegate to jdbc for jdbc datasource
> -
>
> Key: SPARK-14525
> URL: https://issues.apache.org/jira/browse/SPARK-14525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Justin Pihony
>Priority: Minor
>
> If you call {code}df.write.format("jdbc")...save(){code} then you get an 
> error  
> bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
> allow create table as select
> save is a more intuitive guess on the appropriate method to call, so the user 
> should not be punished for not knowing about the jdbc method. 
> Obviously, this will require the caller to have set up the correct parameters 
> for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2016-04-19 Thread Ceki Gulcu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247592#comment-15247592
 ] 

Ceki Gulcu commented on SPARK-14703:


@srowen Being able to configure loggers has been a oft-requested feature for 
SLF4J. Can you briefly describe the *essential* configuration primitives you 
would like SLF4J support? 

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247599#comment-15247599
 ] 

Sean Owen commented on SPARK-14703:
---

Oh hey [~ceki]. In Spark, it's almost entirely about settings log levels on 
loggers. I can't think of anything else it needs to do with loggers, like set a 
log format or anything. (Or at least, that is not essential.)

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14727) NullPointerException while trying to launch local spark job

2016-04-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247600#comment-15247600
 ] 

Sean Owen commented on SPARK-14727:
---

(Was that the problem? since you re-resolved it)

> NullPointerException while trying to launch local spark job
> ---
>
> Key: SPARK-14727
> URL: https://issues.apache.org/jira/browse/SPARK-14727
> Project: Spark
>  Issue Type: Bug
>Reporter: Darshan Mehta
> Attachments: Logs.log, SparkCrud.java
>
>
> OS : Windows 10
> Spark Version : 1.6.1
> Java version : 1.8
> I am trying to launch a simple Spark job from eclipse, after starting spark 
> master and registering one worker. JavaRDDs are created successfully, 
> however, a NPE is thrown while collect() operation is executed. Below are the 
> steps that I performed:
> 1. Downloaded Spark 1.6.1
> 2. Built it locally with 'sbt package' and 'sbt assembly' commands 
> 3. Started Master with 'spark-class org.apache.spark.deploy.master.Master'
> 4. Started Worker with 'spark-class org.apache.spark.deploy.worker.Worker 
> spark://master:7077 -c 2'
> 5. Verified both Master and Worker are up, and have enough resources in Spark 
> UI
> 6. Created a maven project in eclipse, with spark dependency
> 7. Executed attached "SparkCrud.java" in eclipse
> 8. NPE is thrown, logs are attached "Logs.log"
> It seems it's trying to execute Hadoop binaries, however, I am not using 
> Hadoop anywhere at all. Also, I tried placing winutil.exe in C:\\ and 
> configured "hadoop.home.dir" System property (as suggested in another JIRA), 
> however that doesn't seem to have done the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14326) Can't specify "long" type in structField

2016-04-19 Thread Dmitriy Selivanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247602#comment-15247602
 ] 

Dmitriy Selivanov commented on SPARK-14326:
---

This particular case was related to reading csv files with spark-csv package. 
Of course spark-csv can infer schema, but this require 2 scans over data.

When I need int64 in R, I use bit64 package.

> Can't specify "long" type in structField
> 
>
> Key: SPARK-14326
> URL: https://issues.apache.org/jira/browse/SPARK-14326
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Dmitriy Selivanov
>
> tried `long`, `bigint`, `LongType`, `Long`. Nothing works...
> {code}
> schema <- structType(structField("id", "long"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14730) Expose ColumnPruner as feature transformer

2016-04-19 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-14730:
---

 Summary: Expose ColumnPruner as feature transformer
 Key: SPARK-14730
 URL: https://issues.apache.org/jira/browse/SPARK-14730
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.0.0
Reporter: Jacek Laskowski
Priority: Minor


>From d...@spark.apache.org:

{quote}
Jacek:
Came across `private class ColumnPruner` with "TODO(ekl) make this a
public transformer" in scaladoc, cf.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L317.

Why is this private and is there a JIRA for the TODO(ekl)?
{quote}

{quote}
Yanbo Liang:
This is due to ColumnPruner is only used for RFormula currently, we did not 
expose it as a feature transformer.
Please feel free to create JIRA and work on it.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-04-19 Thread Kai Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247672#comment-15247672
 ] 

Kai Jiang commented on SPARK-928:
-

Since it is labeled as Starter, I would like to take a try on this one.

> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-19 Thread Abou Haydar Elias (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247710#comment-15247710
 ] 

Abou Haydar Elias commented on SPARK-14489:
---

I totally agree with [~sethah]. I have stumbled on this today. It seems like 
when randomly splitting, we can end up with a test set containing user now 
available in the training set which can also happen to and item. Thus a certain 
user can't have predictions an ALS produces and NaN instead. We are falling 
into the new user/new item problematic. So what can we do here ?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14600) Push predicates through Expand

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14600:


Assignee: (was: Apache Spark)

> Push predicates through Expand
> --
>
> Key: SPARK-14600
> URL: https://issues.apache.org/jira/browse/SPARK-14600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping 
> attributes came from Project, but have different meaning in Project (equal to 
> original grouping expression) and Expand (could be original grouping 
> expression or null), this does not make sense, because the attribute has 
> different result in different operator,
>  A better way could be  Aggregate(Expand()), then we need to  fix SQL 
> generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14600) Push predicates through Expand

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14600:


Assignee: Apache Spark

> Push predicates through Expand
> --
>
> Key: SPARK-14600
> URL: https://issues.apache.org/jira/browse/SPARK-14600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping 
> attributes came from Project, but have different meaning in Project (equal to 
> original grouping expression) and Expand (could be original grouping 
> expression or null), this does not make sense, because the attribute has 
> different result in different operator,
>  A better way could be  Aggregate(Expand()), then we need to  fix SQL 
> generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14600) Push predicates through Expand

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247750#comment-15247750
 ] 

Apache Spark commented on SPARK-14600:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12496

> Push predicates through Expand
> --
>
> Key: SPARK-14600
> URL: https://issues.apache.org/jira/browse/SPARK-14600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping 
> attributes came from Project, but have different meaning in Project (equal to 
> original grouping expression) and Expand (could be original grouping 
> expression or null), this does not make sense, because the attribute has 
> different result in different operator,
>  A better way could be  Aggregate(Expand()), then we need to  fix SQL 
> generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14577) spark.sql.codegen.maxCaseBranches config option

2016-04-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-14577.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12353
[https://github.com/apache/spark/pull/12353]

> spark.sql.codegen.maxCaseBranches config option
> ---
>
> Key: SPARK-14577
> URL: https://issues.apache.org/jira/browse/SPARK-14577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
>
> We currently disable codegen for CaseWhen if the number of branches is 
> greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better 
> if this value is a non-public config defined in SQLConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14577) spark.sql.codegen.maxCaseBranches config option

2016-04-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-14577:

Assignee: Dongjoon Hyun

> spark.sql.codegen.maxCaseBranches config option
> ---
>
> Key: SPARK-14577
> URL: https://issues.apache.org/jira/browse/SPARK-14577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> We currently disable codegen for CaseWhen if the number of branches is 
> greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better 
> if this value is a non-public config defined in SQLConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10574:


Assignee: Apache Spark  (was: Yanbo Liang)

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Apache Spark
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10574:


Assignee: Yanbo Liang  (was: Apache Spark)

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247794#comment-15247794
 ] 

Apache Spark commented on SPARK-10574:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12498

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14034) Converting to Dataset causes wrong order and values in nested array of documents

2016-04-19 Thread Barry Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247988#comment-15247988
 ] 

Barry Jones commented on SPARK-14034:
-

I have the same issue. Nested data values are associated with the wrong keys in 
the case class. From what I can tell the incorrect association seems consistent 
(I haven't noticed it shuffling arbitrarily upon reloads). I have tried 
creating a new file promoting the nested data to top-level data, and have 
confirmed it is correctly loaded when it isn't nested. So there appears to be a 
bug loading nested JSON as a Dataset.

> Converting to Dataset causes wrong order and values in nested array of 
> documents
> 
>
> Key: SPARK-14034
> URL: https://issues.apache.org/jira/browse/SPARK-14034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Steven She
>
> I'm deserializing the following JSON document into a Dataset with Spark 1.6.1 
> in the console:
> {noformat}
> {"arr": [{"c": 1, "b": 2, "a": 3}]}
> {noformat}
> I have the following case classes:
> {noformat}
> case class X(arr: Seq[Y])
> case class Y(c: Int, b: Int, a: Int)
> {noformat}
> I run the following in the console to retrieve the value of `c` in the array, 
> which should have a value of 1 in the data file, but I get the value 3 
> instead:
> {noformat}
> scala> sqlContext.read.json("../test.json").as[X].collect().head.arr.head.c
> res19: Int = 3
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14034) Converting to Dataset causes wrong order and values in nested array of documents

2016-04-19 Thread Barry Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247988#comment-15247988
 ] 

Barry Jones edited comment on SPARK-14034 at 4/19/16 3:42 PM:
--

I have the same issue. Nested data values are associated with the wrong keys in 
the case class. From what I can tell the incorrect association seems consistent 
(I haven't noticed it shuffling arbitrarily upon reloads). I have tried 
creating a new file promoting the nested data to top-level data, and have 
confirmed it is correctly loaded when it isn't nested. So there appears to be a 
bug loading JSON containing arrays of nested data as a Dataset.


was (Author: barryjones):
I have the same issue. Nested data values are associated with the wrong keys in 
the case class. From what I can tell the incorrect association seems consistent 
(I haven't noticed it shuffling arbitrarily upon reloads). I have tried 
creating a new file promoting the nested data to top-level data, and have 
confirmed it is correctly loaded when it isn't nested. So there appears to be a 
bug loading nested JSON as a Dataset.

> Converting to Dataset causes wrong order and values in nested array of 
> documents
> 
>
> Key: SPARK-14034
> URL: https://issues.apache.org/jira/browse/SPARK-14034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Steven She
>
> I'm deserializing the following JSON document into a Dataset with Spark 1.6.1 
> in the console:
> {noformat}
> {"arr": [{"c": 1, "b": 2, "a": 3}]}
> {noformat}
> I have the following case classes:
> {noformat}
> case class X(arr: Seq[Y])
> case class Y(c: Int, b: Int, a: Int)
> {noformat}
> I run the following in the console to retrieve the value of `c` in the array, 
> which should have a value of 1 in the data file, but I get the value 3 
> instead:
> {noformat}
> scala> sqlContext.read.json("../test.json").as[X].collect().head.arr.head.c
> res19: Int = 3
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13962) spark.ml Evaluators should support other numeric types for label

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13962:


Assignee: Benjamin Fradet  (was: Apache Spark)

> spark.ml Evaluators should support other numeric types for label
> 
>
> Key: SPARK-13962
> URL: https://issues.apache.org/jira/browse/SPARK-13962
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13962) spark.ml Evaluators should support other numeric types for label

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13962:


Assignee: Apache Spark  (was: Benjamin Fradet)

> spark.ml Evaluators should support other numeric types for label
> 
>
> Key: SPARK-13962
> URL: https://issues.apache.org/jira/browse/SPARK-13962
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13962) spark.ml Evaluators should support other numeric types for label

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248120#comment-15248120
 ] 

Apache Spark commented on SPARK-13962:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/12500

> spark.ml Evaluators should support other numeric types for label
> 
>
> Key: SPARK-13962
> URL: https://issues.apache.org/jira/browse/SPARK-13962
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13681) Reimplement CommitFailureTestRelationSuite

2016-04-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13681.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12179
[https://github.com/apache/spark/pull/12179]

> Reimplement CommitFailureTestRelationSuite
> --
>
> Key: SPARK-13681
> URL: https://issues.apache.org/jira/browse/SPARK-13681
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
> Fix For: 2.0.0
>
>
> This test case got broken by 
> [#11509|https://github.com/apache/spark/pull/11509].  We should reimplement 
> it as a format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13681) Reimplement CommitFailureTestRelationSuite

2016-04-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13681:
-
Assignee: Cheng Lian

> Reimplement CommitFailureTestRelationSuite
> --
>
> Key: SPARK-13681
> URL: https://issues.apache.org/jira/browse/SPARK-13681
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 2.0.0
>
>
> This test case got broken by 
> [#11509|https://github.com/apache/spark/pull/11509].  We should reimplement 
> it as a format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-19 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248167#comment-15248167
 ] 

Marcelo Vanzin commented on SPARK-14725:


I don't remember if there's an option to use the HTTP transport for the repl, 
but there may not be. It's definitely not the default anymore. I'm fine with 
nuking it.

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14731) Revert SPARK-12130 to make 2.0 shuffle service compatible with 1.x

2016-04-19 Thread Mark Grover (JIRA)

Mark Grover created SPARK-14731:
---

 Summary: Revert SPARK-12130 to make 2.0 shuffle service compatible 
with 1.x
 Key: SPARK-14731
 URL: https://issues.apache.org/jira/browse/SPARK-14731
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.0.0
Reporter: Mark Grover


Discussion on the dev list on [this 
thread|http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html].

Conclusion seems to be that we should try to maintain compatibility between 
Spark 1.x and Spark 2.x's shuffle service so folks who may want to run Spark 1 
and Spark 2 on, say, the same YARN cluster can do that easily while running 
only one shuffle service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14491) refactor object operator framework to make it easy to eliminate serializations

2016-04-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14491.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12260
[https://github.com/apache/spark/pull/12260]

> refactor object operator framework to make it easy to eliminate serializations
> --
>
> Key: SPARK-14491
> URL: https://issues.apache.org/jira/browse/SPARK-14491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM

2016-04-19 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248186#comment-15248186
 ] 

yuhao yang commented on SPARK-14709:


I put the prototype in 
https://github.com/hhbyyh/spark/blob/mlsvm/mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala.
 It's just a simple version with OWLQN and Hinge gradient.

I plan to implement another version with parallel SMO before sending a pull 
request. 

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13179) pyspark row name collision 'count'

2016-04-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13179.
--
Resolution: Won't Fix

> pyspark row name collision 'count'
> --
>
> Key: SPARK-13179
> URL: https://issues.apache.org/jira/browse/SPARK-13179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: David Fagnan
>
> The following example from the documentation results in a name collision:
> {code:none}
> >>>  df = sc.parallelize([ Row(name='Alice', age=5, height=80), 
> >>> Row(name='Alice', age=10, height=140)]).toDF()
> >>> alice_counts = df.groupby(df.name).count().collect()
> >>> print(alice_counts[0])
> Row(name=u'Alice',count=2)
> >>> print(alice_counts[0].name)
> Alice
> {code}
> Which is correct, but the column name count results in the name collision 
> below:
> {code:none}
> >>> print(alice_counts[0].count)
> 
> {code}
> The collision results from the inherited method count from python tuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14732) spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian

2016-04-19 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14732:
-

 Summary: spark.ml GaussianMixture should not use spark.mllib 
MultivariateGaussian
 Key: SPARK-14732
 URL: https://issues.apache.org/jira/browse/SPARK-14732
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


{{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} currently 
returns the {{MultivariateGaussian}} type from spark.mllib.  We should copy the 
MultivariateGaussian class into spark.ml to avoid referencing spark.mllib types 
publicly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13745) Support columnar in memory representation on Big Endian platforms

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248221#comment-15248221
 ] 

Apache Spark commented on SPARK-13745:
--

User 'robbinspg' has created a pull request for this issue:
https://github.com/apache/spark/pull/12501

> Support columnar in memory representation on Big Endian platforms
> -
>
> Key: SPARK-13745
> URL: https://issues.apache.org/jira/browse/SPARK-13745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tim Preece
>  Labels: big-endian
>
> SPARK-12785 introduced a columnar in memory representation. 
> Currently this feature is explicitly only supported on Little Endian 
> platorms. On Big Endian platforms the following exception is thrown:
> "org.apache.commons.lang.NotImplementedException: Only little endian is 
> supported."
> This JIRA should be used to extend support to Big Endian architectures, and 
> decide whether the "in memory" columnar format should be consistent with 
> parquet format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14564) Python Word2Vec missing setWindowSize method

2016-04-19 Thread Brad Willard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248230#comment-15248230
 ] 

Brad Willard commented on SPARK-14564:
--

Do you guys think it's possible to get this in 1.6.2 release as well since it's 
minor.

> Python Word2Vec missing setWindowSize method
> 
>
> Key: SPARK-14564
> URL: https://issues.apache.org/jira/browse/SPARK-14564
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
> Environment: pyspark
>Reporter: Brad Willard
>Assignee: Jason C Lee
>Priority: Minor
>  Labels: ml, pyspark, python, word2vec
> Fix For: 2.0.0
>
>
> The setWindowSize method when constructing the Word2Vec model is available in 
> scala but missing in python so you're stuck with a window of 5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore

2016-04-19 Thread Vijay Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248241#comment-15248241
 ] 

Vijay Parmar commented on SPARK-13662:
--

Thank you Evan! 

If possible then you please guide me to some study material which you think 
might be helpful in this regard.

Thanks
Vijay


> [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore 
> --
>
> Key: SPARK-13662
> URL: https://issues.apache.org/jira/browse/SPARK-13662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: All
>Reporter: Evan Chan
>
> Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or 
> equivalently the HiveContext.tables method, returns a DataFrame with only two 
> columns: the name of the table and whether it is temporary.  It would be 
> really nice to add support to return some extra information, such as:
> - Whether this table is Spark-only or a native Hive table
> - If spark-only, the name of the data source
> - potentially other properties
> The first two is really useful for BI environments connecting to multiple 
> data sources and that work with both Hive and Spark.
> Some thoughts:
> - The SQL/HiveContext Catalog API might need to be expanded to return 
> something like a TableEntry, rather than just a tuple of (name, temporary).
> - I believe there is a Hive Catalog/client API to get information about each 
> table.  I suppose one concern would be the speed of using this API.  Perhaps 
> there are other APis that can get this info faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14732) spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian

2016-04-19 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14732:
--
Description: 
{{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} currently 
returns the {{MultivariateGaussian}} type from spark.mllib.  We should copy the 
MultivariateGaussian class into spark.ml to avoid referencing spark.mllib types 
publicly.

I'll put it in mllib-local under 
{{spark.ml.stat.distribution.MultivariateGaussian}}.

  was:{{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} 
currently returns the {{MultivariateGaussian}} type from spark.mllib.  We 
should copy the MultivariateGaussian class into spark.ml to avoid referencing 
spark.mllib types publicly.


> spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian
> 
>
> Key: SPARK-14732
> URL: https://issues.apache.org/jira/browse/SPARK-14732
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> {{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} currently 
> returns the {{MultivariateGaussian}} type from spark.mllib.  We should copy 
> the MultivariateGaussian class into spark.ml to avoid referencing spark.mllib 
> types publicly.
> I'll put it in mllib-local under 
> {{spark.ml.stat.distribution.MultivariateGaussian}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore

2016-04-19 Thread Vijay Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248241#comment-15248241
 ] 

Vijay Parmar edited comment on SPARK-13662 at 4/19/16 5:27 PM:
---

Thank you Evan! 

If possible then can you please guide me to some study material/ blog/ site 
which you think might be helpful in this regard.

Thanks
Vijay



was (Author: vsparmar):
Thank you Evan! 

If possible then you please guide me to some study material which you think 
might be helpful in this regard.

Thanks
Vijay


> [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore 
> --
>
> Key: SPARK-13662
> URL: https://issues.apache.org/jira/browse/SPARK-13662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: All
>Reporter: Evan Chan
>
> Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or 
> equivalently the HiveContext.tables method, returns a DataFrame with only two 
> columns: the name of the table and whether it is temporary.  It would be 
> really nice to add support to return some extra information, such as:
> - Whether this table is Spark-only or a native Hive table
> - If spark-only, the name of the data source
> - potentially other properties
> The first two is really useful for BI environments connecting to multiple 
> data sources and that work with both Hive and Spark.
> Some thoughts:
> - The SQL/HiveContext Catalog API might need to be expanded to return 
> something like a TableEntry, rather than just a tuple of (name, temporary).
> - I believe there is a Hive Catalog/client API to get information about each 
> table.  I suppose one concern would be the speed of using this API.  Perhaps 
> there are other APis that can get this info faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12457) Add ExpressionDescription to collection functions

2016-04-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12457.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Add ExpressionDescription to collection functions
> -
>
> Key: SPARK-12457
> URL: https://issues.apache.org/jira/browse/SPARK-12457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14676) Catch, wrap, and re-throw exceptions from Await.result in order to capture full stacktrace

2016-04-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14676.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Catch, wrap, and re-throw exceptions from Await.result in order to capture 
> full stacktrace
> --
>
> Key: SPARK-14676
> URL: https://issues.apache.org/jira/browse/SPARK-14676
> Project: Spark
>  Issue Type: Bug
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> When {{Await.result}} throws an exception which originated from a different 
> thread, the resulting stacktrace doesn't include the path leading to the 
> {{Await.result()}} call itself, making it difficult to identify the impact of 
> these exceptions. For example, I've seen cases where broadcast cleaning 
> errors propagate to the main thread and crash it but the resulting stacktrace 
> doesn't include any of the main thread's code, making it difficult to 
> pinpoint which exception crashed that thread. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14676) Catch, wrap, and re-throw exceptions from Await.result in order to capture full stacktrace

2016-04-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248257#comment-15248257
 ] 

Reynold Xin commented on SPARK-14676:
-

[~joshrosen]  I didn't merge this in 1.6.2 because it is a bit too big to 
backport. Let me know if you want to create a smaller patch for 1.6.


> Catch, wrap, and re-throw exceptions from Await.result in order to capture 
> full stacktrace
> --
>
> Key: SPARK-14676
> URL: https://issues.apache.org/jira/browse/SPARK-14676
> Project: Spark
>  Issue Type: Bug
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> When {{Await.result}} throws an exception which originated from a different 
> thread, the resulting stacktrace doesn't include the path leading to the 
> {{Await.result()}} call itself, making it difficult to identify the impact of 
> these exceptions. For example, I've seen cases where broadcast cleaning 
> errors propagate to the main thread and crash it but the resulting stacktrace 
> doesn't include any of the main thread's code, making it difficult to 
> pinpoint which exception crashed that thread. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema

2016-04-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14566.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12179
[https://github.com/apache/spark/pull/12179]

> When appending to partitioned persisted table, we should apply a projection 
> over input query plan using existing metastore schema
> -
>
> Key: SPARK-14566
> URL: https://issues.apache.org/jira/browse/SPARK-14566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> Take the following snippets slightly modified from test case 
> "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example:
> {code}
> val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j")
> df1.write.partitionBy("i").saveAsTable("tbl11453")
> val df2 = Seq("3" -> "30").toDF("i", "j")
> df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453")
> {code}
> Although {{df1.schema}} is {{}}, schema of persisted 
> table {{tbl11453}} is actually {{}} because {{i}} is a 
> partition column, which is always appended after all data columns. Thus, when 
> appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are 
> actually different.
> In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply 
> applies existing metastore schema to the input query plan ([see 
> here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]),
>  which is wrong. A projection should be used instead to adjust column order 
> here.
> In branch-1.6, [this projection is added in 
> {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104],
>  but was removed in Spark 2.0. Replacing the aforementioned line in 
> {{CreateMetastoreDataSourceAsSelect}} with a projection should more 
> preferrable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14458) Wrong data schema is passed to FileFormat data sources that can't infer schema

2016-04-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14458.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12179
[https://github.com/apache/spark/pull/12179]

> Wrong data schema is passed to FileFormat data sources that can't infer schema
> --
>
> Key: SPARK-14458
> URL: https://issues.apache.org/jira/browse/SPARK-14458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> When instantiating a {{FileFormat}} data source that is not able to infer its 
> schema from data files, {{DataSource}} passes the full schema including 
> partition columns to {{HadoopFsRelation}}. We should filter out partition 
> columns and only preserve data columns actually live in data files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14414) Make error messages consistent across DDLs

2016-04-19 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248265#comment-15248265
 ] 

Bo Meng commented on SPARK-14414:
-

Can anyone update the 'Assignee' for this one, since my code was already merged 
in? If there are still something left I can work on, please advice, thanks!

> Make error messages consistent across DDLs
> --
>
> Key: SPARK-14414
> URL: https://issues.apache.org/jira/browse/SPARK-14414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There are many different error messages right now when the user tries to run 
> something that's not supported. We might throw AnalysisException or 
> ParseException or NoSuchFunctionException etc. We should make all of these 
> consistent before 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-19 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248272#comment-15248272
 ] 

Felix Cheung commented on SPARK-12148:
--

I'm up for this if [~sunrui] you haven't started

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14730) Expose ColumnPruner as feature transformer

2016-04-19 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14730:
--
Affects Version/s: (was: 2.0.0)

> Expose ColumnPruner as feature transformer
> --
>
> Key: SPARK-14730
> URL: https://issues.apache.org/jira/browse/SPARK-14730
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Jacek Laskowski
>Priority: Minor
>
> From d...@spark.apache.org:
> {quote}
> Jacek:
> Came across `private class ColumnPruner` with "TODO(ekl) make this a
> public transformer" in scaladoc, cf.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L317.
> Why is this private and is there a JIRA for the TODO(ekl)?
> {quote}
> {quote}
> Yanbo Liang:
> This is due to ColumnPruner is only used for RFormula currently, we did not 
> expose it as a feature transformer.
> Please feel free to create JIRA and work on it.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14730) Expose ColumnPruner as feature transformer

2016-04-19 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14730:
--
Issue Type: New Feature  (was: Improvement)

> Expose ColumnPruner as feature transformer
> --
>
> Key: SPARK-14730
> URL: https://issues.apache.org/jira/browse/SPARK-14730
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Jacek Laskowski
>Priority: Minor
>
> From d...@spark.apache.org:
> {quote}
> Jacek:
> Came across `private class ColumnPruner` with "TODO(ekl) make this a
> public transformer" in scaladoc, cf.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L317.
> Why is this private and is there a JIRA for the TODO(ekl)?
> {quote}
> {quote}
> Yanbo Liang:
> This is due to ColumnPruner is only used for RFormula currently, we did not 
> expose it as a feature transformer.
> Please feel free to create JIRA and work on it.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14675) ClassFormatError in codegen when using Aggregator

2016-04-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14675.
--
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.0.0

This issue has been resolved by https://github.com/apache/spark/pull/12468.

> ClassFormatError in codegen when using Aggregator
> -
>
> Key: SPARK-14675
> URL: https://issues.apache.org/jira/browse/SPARK-14675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> code:
> {noformat}
>   val toList = new Aggregator[(String, Int), Seq[Int], Seq[Int]] {
> def bufferEncoder: Encoder[Seq[Int]] = implicitly[Encoder[Seq[Int]]]
> def finish(reduction: Seq[Int]): Seq[Int] = reduction
> def merge(b1: Seq[Int],b2: Seq[Int]): Seq[Int] = b1 ++ b2
> def outputEncoder: Encoder[Seq[Int]] = implicitly[Encoder[Seq[Int]]]
> def reduce(b: Seq[Int],a: (String, Int)): Seq[Int] = b :+ a._2
> def zero: Seq[Int] = Seq.empty[Int]
>   }
>   val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
>   val ds2 = ds1.groupByKey(_._1).agg(toList.toColumn)
>   ds2.show
> {noformat}
> this gives me:
> {noformat}
> 6/04/15 18:31:22 WARN TaskSetManager: Lost task 1.0 in stage 3.0 (TID 7, 
> localhost): java.lang.ClassFormatError: Duplicate field name&signature in 
> class file 
> org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificMutableProjection
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> org.codehaus.janino.ByteArrayClassLoader.findClass(ByteArrayClassLoader.java:66)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass.generate(Unknown 
> Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$create$2.apply(GenerateMutableProjection.scala:140)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$create$2.apply(GenerateMutableProjection.scala:139)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:178)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:197)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregate$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregate.scala:80)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregate$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregate.scala:71)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:768)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:768)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> when i do:
> {noformat}
>  ds2.queryExecution.debug.codegen()
> {noformat}
> i get:
> {noformat}
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> WholeStageCodegen
> :  +- Sort [value#6 ASC], false, 0
> : +- INPUT
> +- AppendColumns , newInstance(class scala.Tuple2), 
> [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, input[0, java.lang.String], true) AS value#6]
>+- LocalTableScan [_1#2,_2#3], 
> [[0,180001,1,61],[0,180001,2,61],[0,180001,3,61]]
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */ 
> /* 005 */ /** Codegened pipeline for:
> /* 006 */ * Sort [value#6 ASC], false, 0
> /* 00

[jira] [Commented] (SPARK-14051) Implement `Double.NaN==Float.NaN` for consistency

2016-04-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248284#comment-15248284
 ] 

Dongjoon Hyun commented on SPARK-14051:
---

Hi, [~joshrosen].
Could you take a look at this PR?
Sorry for duplicated messages.
I left some comments for why I'm asking you in PR.

> Implement `Double.NaN==Float.NaN` for consistency
> -
>
> Key: SPARK-14051
> URL: https://issues.apache.org/jira/browse/SPARK-14051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The 
> only exception case is direct comparison between  `Row(Float.NaN)` and 
> `Row(Double.NaN)`. The following is the example: the last two expressions had 
> better be *true* and *List([NaN])* for consistency.
> {code}
> scala> 
> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
> scala> sql("select a,b,a=b from tmp").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
> scala> val row_a = sql("select a from tmp").collect()
> row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> val row_b = sql("select b from tmp").collect()
> row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> row_a(0) == row_b(0)
> res2: Boolean = true
> scala> List(row_a(0),row_b(0)).distinct
> res3: List[org.apache.spark.sql.Row] = List([1.0])
> scala> row_a(1) == row_b(1)
> res4: Boolean = false
> scala> List(row_a(1),row_b(1)).distinct
> res5: List[org.apache.spark.sql.Row] = List([NaN], [NaN])
> {code}
> Please note that the following background truths as of today.
> * Double.NaN != Double.NaN (Scala/Java/IEEE Standard)
> * Float.NaN != Float.NaN (Scala/Java/IEEE Standard)
> * Double.NaN != Float.NaN (Scala/Java/IEEE Standard)
> * Row(Double.NaN) == Row(Double.NaN)
> * Row(Float.NaN) == Row(Float.NaN)
> * *Row(Double.NaN) != Row(Float.NaN)*  <== The problem of this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14326) Can't specify "long" type in structField

2016-04-19 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248335#comment-15248335
 ] 

Shivaram Venkataraman commented on SPARK-14326:
---

I think its fine to support this in StructField / other schema specification 
that we use to pass information to Scala. I don't think adding a int64 
dependency is necessarily something we should do as per the discussion in 
https://issues.apache.org/jira/browse/SPARK-12360 

> Can't specify "long" type in structField
> 
>
> Key: SPARK-14326
> URL: https://issues.apache.org/jira/browse/SPARK-14326
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Dmitriy Selivanov
>
> tried `long`, `bigint`, `LongType`, `Long`. Nothing works...
> {code}
> schema <- structType(structField("id", "long"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14733) Allow custom timing control in microbenchmarks

2016-04-19 Thread Eric Liang (JIRA)

Eric Liang created SPARK-14733:
--

 Summary: Allow custom timing control in microbenchmarks
 Key: SPARK-14733
 URL: https://issues.apache.org/jira/browse/SPARK-14733
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang


The current benchmark framework runs a code block for several iterations and 
reports statistics. However there is no way to exclude per-iteration setup time 
from the overall results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14734) Add conversions between mllib and ml Vector, Matrix types

2016-04-19 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14734:
-

 Summary: Add conversions between mllib and ml Vector, Matrix types
 Key: SPARK-14734
 URL: https://issues.apache.org/jira/browse/SPARK-14734
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be 
useful to have {{private[spark]}} methods for converting from one linear 
algebra representation to another.  I am running into this issue in 
[SPARK-14732].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14733) Allow custom timing control in microbenchmarks

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14733:


Assignee: (was: Apache Spark)

> Allow custom timing control in microbenchmarks
> --
>
> Key: SPARK-14733
> URL: https://issues.apache.org/jira/browse/SPARK-14733
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>
> The current benchmark framework runs a code block for several iterations and 
> reports statistics. However there is no way to exclude per-iteration setup 
> time from the overall results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14733) Allow custom timing control in microbenchmarks

2016-04-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14733:


Assignee: Apache Spark

> Allow custom timing control in microbenchmarks
> --
>
> Key: SPARK-14733
> URL: https://issues.apache.org/jira/browse/SPARK-14733
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> The current benchmark framework runs a code block for several iterations and 
> reports statistics. However there is no way to exclude per-iteration setup 
> time from the overall results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14733) Allow custom timing control in microbenchmarks

2016-04-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248359#comment-15248359
 ] 

Apache Spark commented on SPARK-14733:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/12502

> Allow custom timing control in microbenchmarks
> --
>
> Key: SPARK-14733
> URL: https://issues.apache.org/jira/browse/SPARK-14733
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>
> The current benchmark framework runs a code block for several iterations and 
> reports statistics. However there is no way to exclude per-iteration setup 
> time from the overall results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end

2016-04-19 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248363#comment-15248363
 ] 

Shivaram Venkataraman commented on SPARK-14692:
---

I think SparkR might not have been built ? Is this from cloning the source code 
or from downloading a release. 

Also please note that JIRA issues are typically used to track development of 
features / bugs and this question is more suited to the spark users mailing 
list http://spark.apache.org/community.html

> Error While Setting the path for R front end
> 
>
> Key: SPARK-14692
> URL: https://issues.apache.org/jira/browse/SPARK-14692
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Mac OSX
>Reporter: Niranjan Molkeri`
>
> Trying to set Environment path for SparkR in RStudio. 
> Getting this bug. 
> > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> > library(SparkR)
> Error in library(SparkR) : there is no package called ‘SparkR’
> > sc <- sparkR.init(master="local")
> Error: could not find function "sparkR.init"
> In the directory which it is pointed. There is directory called SparkR. I 
> don't know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14325) some strange name conflicts in `group_by`

2016-04-19 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248374#comment-15248374
 ] 

Shivaram Venkataraman commented on SPARK-14325:
---

I think the problem might be related to `x` being an argument of type 
GroupedData to agg ? Does your code work if you replace `x = "sum"` with some 
other name ?
{code}
setMethod("agg",
  signature(x = "GroupedData"),
  function(x, ...) {
{code}

> some strange name conflicts in `group_by`
> -
>
> Key: SPARK-14325
> URL: https://issues.apache.org/jira/browse/SPARK-14325
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0, 1.6.1
> Environment: sparkR 1.6.0
>Reporter: Dmitriy Selivanov
>
> group_by strange behaviour when try to aggregate by column with name "x".
> consider following example
> {code}
> df
> # DataFrame[userId:bigint, type:string, j:int, x:int]
> df %>%group_by(df$userId, df$type, df$j) %>% agg(x = "sum")
> #Error in (function (classes, fdef, mtable)  : 
> #  unable to find an inherited method for function ‘agg’ for signature 
> ‘"character"’
> {code}
> after renaming x -> x2 works just file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 216 matches

Mail list logo