date:20160105

[jira] [Assigned] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12570:


Assignee: Apache Spark

> DecisionTreeRegressor: provide variance of prediction: user guide update
> 
>
> Key: SPARK-12570
> URL: https://issues.apache.org/jira/browse/SPARK-12570
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> See linked JIRA for details.  This should update the table of output columns 
> and text.  Examples are probably not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082631#comment-15082631
 ] 

Apache Spark commented on SPARK-12570:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10594

> DecisionTreeRegressor: provide variance of prediction: user guide update
> 
>
> Key: SPARK-12570
> URL: https://issues.apache.org/jira/browse/SPARK-12570
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See linked JIRA for details.  This should update the table of output columns 
> and text.  Examples are probably not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12570:


Assignee: (was: Apache Spark)

> DecisionTreeRegressor: provide variance of prediction: user guide update
> 
>
> Key: SPARK-12570
> URL: https://issues.apache.org/jira/browse/SPARK-12570
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See linked JIRA for details.  This should update the table of output columns 
> and text.  Examples are probably not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12623) map key_values to values

2016-01-05 Thread Elazar Gershuni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082639#comment-15082639
 ] 

Elazar Gershuni commented on SPARK-12623:
-

That does not answer the question/feature request. Mapping values to values can 
be achieved by similar code to the one you suggested:

rdd.map { case (key, value) => (key, myFunctionOf(value)) }

Yet Spark does provide rdd.mapValues(), for performance reasons (retaining the 
partitioning - avoiding the need to reshuffle when the key does not change).
I would like to enjoy similar benefits for my case too. The code that you 
suggested does not, since spark cannot know that the key does not change.

I'm sorry if that's not the place for the question/feature request, but it 
really isn't a user question.

> map key_values to values
> 
>
> Key: SPARK-12623
> URL: https://issues.apache.org/jira/browse/SPARK-12623
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Elazar Gershuni
>Priority: Minor
>  Labels: easyfix, features, performance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Why doesn't the argument to mapValues() take a key as an agument? 
> Alternatively, can we have a "mapKeyValuesToValues" that does?
> Use case: I want to write a simpler analyzer that takes the argument to 
> map(), and analyze it to see whether it (trivially) doesn't change the key, 
> e.g. 
> g = lambda kv: (kv[0], f(kv[0], kv[1]))
> rdd.map(g)
> Problem is, if I find that it is the case, I can't call mapValues() with that 
> function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues 
> receives only `v` as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12644) Vectorize/Batch decode parquet

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12644:


Assignee: Nong Li  (was: Apache Spark)

> Vectorize/Batch decode parquet
> --
>
> Key: SPARK-12644
> URL: https://issues.apache.org/jira/browse/SPARK-12644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12644) Vectorize/Batch decode parquet

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12644:


Assignee: Apache Spark  (was: Nong Li)

> Vectorize/Batch decode parquet
> --
>
> Key: SPARK-12644
> URL: https://issues.apache.org/jira/browse/SPARK-12644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12644) Vectorize/Batch decode parquet

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082629#comment-15082629
 ] 

Apache Spark commented on SPARK-12644:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10593

> Vectorize/Batch decode parquet
> --
>
> Key: SPARK-12644
> URL: https://issues.apache.org/jira/browse/SPARK-12644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12623) map key_values to values

2016-01-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082739#comment-15082739
 ] 

Sean Owen commented on SPARK-12623:
---

There is a {{preservesPartitioning}} flag on some API methods that lets you 
specify that your function of {{(key, value)}} pairs won't change keys, or at 
least won't change the partitioning. Unfortunately, for historical reasons this 
wasn't exposed on the {{map()}} function, but was exposed on {{mapPartitions}}. 
It's a little clunky to invoke if you only need map, but not much -- you get an 
iterator that you then map as before.

That would at least let you do what you're trying to do. As to exposing a 
specialized method for this, yeah it's not crazy or anything but I doubt it 
would be viewed as worth it when there's a fairly direct way to do what you 
want. (Or else, I'd say argue for a new param to map, but that has its own 
obscure issues.)

> map key_values to values
> 
>
> Key: SPARK-12623
> URL: https://issues.apache.org/jira/browse/SPARK-12623
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Elazar Gershuni
>Priority: Minor
>  Labels: easyfix, features, performance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Why doesn't the argument to mapValues() take a key as an agument? 
> Alternatively, can we have a "mapKeyValuesToValues" that does?
> Use case: I want to write a simpler analyzer that takes the argument to 
> map(), and analyze it to see whether it (trivially) doesn't change the key, 
> e.g. 
> g = lambda kv: (kv[0], f(kv[0], kv[1]))
> rdd.map(g)
> Problem is, if I find that it is the case, I can't call mapValues() with that 
> function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues 
> receives only `v` as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification

2016-01-05 Thread Vijay Kiran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082764#comment-15082764
 ] 

Vijay Kiran commented on SPARK-12630:
-

I've made the changes, after I run the tests, I'll open a PR.

> Make Parameter Descriptions Consistent for PySpark MLlib Classification
> ---
>
> Key: SPARK-12630
> URL: https://issues.apache.org/jira/browse/SPARK-12630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> classification.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12645) SparkR support hash function

2016-01-05 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12645:

Description: Add hash function for SparkR  (was: SparkR add function hash 
for DataFrame)

> SparkR support hash function 
> -
>
> Key: SPARK-12645
> URL: https://issues.apache.org/jira/browse/SPARK-12645
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add hash function for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12645) SparkR support hash function

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12645:


Assignee: Apache Spark

> SparkR support hash function 
> -
>
> Key: SPARK-12645
> URL: https://issues.apache.org/jira/browse/SPARK-12645
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add hash function for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12645) SparkR support hash function

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12645:


Assignee: (was: Apache Spark)

> SparkR support hash function 
> -
>
> Key: SPARK-12645
> URL: https://issues.apache.org/jira/browse/SPARK-12645
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add hash function for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12646) Support _HOST in kerberos principal for connecting to secure cluster

2016-01-05 Thread Hari Krishna Dara (JIRA)

Hari Krishna Dara created SPARK-12646:
-

 Summary: Support _HOST in kerberos principal for connecting to 
secure cluster
 Key: SPARK-12646
 URL: https://issues.apache.org/jira/browse/SPARK-12646
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Hari Krishna Dara
Priority: Minor


Hadoop supports _HOST as a token that is dynamically replaced with the actual 
hostname at the time the kerberos authentication is done. This is supported in 
many hadoop stacks including YARN. When configuring Spark to connect to secure 
cluster (e.g., yarn-cluster or yarn-client as master), it would be natural to 
extend support for this token to Spark as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12641) Remove unused code related to Hadoop 0.23

2016-01-05 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12641.
-
   Resolution: Fixed
 Assignee: Kousuke Saruta
Fix Version/s: 2.0.0

> Remove unused code related to Hadoop 0.23
> -
>
> Key: SPARK-12641
> URL: https://issues.apache.org/jira/browse/SPARK-12641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently we don't support Hadoop 0.23 but there is a few code related to it 
> so let's clean it up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-05 Thread Adrian Bridgett (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082653#comment-15082653
 ] 

Adrian Bridgett commented on SPARK-12622:
-

Ajesh - that'd be a good improvement (I raised the ticket as it's not obvious 
what the problem is rather than that I really want spaces to work!)  I'd worry 
that someone would then raise a problem about "file:/tmp/f%20oo.jar" failing :-)

Jayadevan - I disliked the space when I saw it (sbt assembly of some in house 
code) but didn't know if it was invalid or not (but made a mental note to ask 
if we could lose the space).  FYI it looks like it's due to name in our sbt 
being "foo data" so we get "foo data-assembly-1.0.jar". Interestingly, the sbt 
example also has spaces: 
http://www.scala-sbt.org/0.13/docs/Howto-Project-Metadata.html

> spark-submit fails on executors when jar has a space in it
> --
>
> Key: SPARK-12622
> URL: https://issues.apache.org/jira/browse/SPARK-12622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
> Environment: Linux, Mesos 
>Reporter: Adrian Bridgett
>Priority: Minor
>
> spark-submit --class foo "Foo.jar"  works
> but when using "f oo.jar" it starts to run and then breaks on the executors 
> as they cannot find the various functions.
> Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this 
> fails immediately.
> {noformat}
> spark-submit --class Foo /tmp/f\ oo.jar
> ...
> spark.jars=file:/tmp/f%20oo.jar
> 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at 
> http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769
> 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 
> (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: 
> Foo$$anonfun$46
> {noformat}
> SPARK-6568 is related but maybe specific to the Windows environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12401) Add support for enums in postgres

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082659#comment-15082659
 ] 

Apache Spark commented on SPARK-12401:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10596

> Add support for enums in postgres
> -
>
> Key: SPARK-12401
> URL: https://issues.apache.org/jira/browse/SPARK-12401
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jaka Jancar
>Priority: Minor
>
> JSON and JSONB types [are now 
> converted|https://github.com/apache/spark/pull/8948/files] into strings on 
> the Spark side instead of throwing. It would be great it [enumerated 
> types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were 
> treated similarly instead of failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12401) Add support for enums in postgres

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12401:


Assignee: (was: Apache Spark)

> Add support for enums in postgres
> -
>
> Key: SPARK-12401
> URL: https://issues.apache.org/jira/browse/SPARK-12401
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jaka Jancar
>Priority: Minor
>
> JSON and JSONB types [are now 
> converted|https://github.com/apache/spark/pull/8948/files] into strings on 
> the Spark side instead of throwing. It would be great it [enumerated 
> types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were 
> treated similarly instead of failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2016-01-05 Thread Lunen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082687#comment-15082687
 ] 

Lunen commented on SPARK-12403:
---

I've managed to get in contact with the people who develops the Spark ODBC 
drivers. They told me that they OEM the driver to Databricks and that they 
don't understand why they would not make the latest driver available. I've also 
tested a trail version of the developer's latest driver and it works perfectly 
fine.

Asked on Databricks' forumn and sent emails to their sales and info department 
explaining the situation. Hopefully someone can help.

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12645) SparkR add function hash

2016-01-05 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12645:

Summary: SparkR add function hash   (was: SparkR add function hash)

> SparkR add function hash 
> -
>
> Key: SPARK-12645
> URL: https://issues.apache.org/jira/browse/SPARK-12645
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR add function hash for DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12645) SparkR support hash function

2016-01-05 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12645:

Summary: SparkR support hash function   (was: SparkR add function hash )

> SparkR support hash function 
> -
>
> Key: SPARK-12645
> URL: https://issues.apache.org/jira/browse/SPARK-12645
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR add function hash for DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12623) map key_values to values

2016-01-05 Thread Elazar Gershuni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082639#comment-15082639
 ] 

Elazar Gershuni edited comment on SPARK-12623 at 1/5/16 8:41 AM:
-

That does not answer the question/feature request. Mapping values to values can 
be achieved by similar code to the one you suggested:

{code}
rdd.map { case (key, value) => (key, myFunctionOf(value)) }
{code}

Yet Spark does provide {{rdd.mapValues()}}, for performance reasons (retaining 
the partitioning - avoiding the need to reshuffle when the key does not change).
I would like to enjoy similar benefits for my case too. The code that you 
suggested does not, since spark cannot know that the key does not change.

I'm sorry if that's not the place for the question/feature request, but it 
really isn't a user question.


was (Author: elazar):
That does not answer the question/feature request. Mapping values to values can 
be achieved by similar code to the one you suggested:

rdd.map { case (key, value) => (key, myFunctionOf(value)) }

Yet Spark does provide rdd.mapValues(), for performance reasons (retaining the 
partitioning - avoiding the need to reshuffle when the key does not change).
I would like to enjoy similar benefits for my case too. The code that you 
suggested does not, since spark cannot know that the key does not change.

I'm sorry if that's not the place for the question/feature request, but it 
really isn't a user question.

> map key_values to values
> 
>
> Key: SPARK-12623
> URL: https://issues.apache.org/jira/browse/SPARK-12623
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Elazar Gershuni
>Priority: Minor
>  Labels: easyfix, features, performance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Why doesn't the argument to mapValues() take a key as an agument? 
> Alternatively, can we have a "mapKeyValuesToValues" that does?
> Use case: I want to write a simpler analyzer that takes the argument to 
> map(), and analyze it to see whether it (trivially) doesn't change the key, 
> e.g. 
> g = lambda kv: (kv[0], f(kv[0], kv[1]))
> rdd.map(g)
> Problem is, if I find that it is the case, I can't call mapValues() with that 
> function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues 
> receives only `v` as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation

2016-01-05 Thread somil deshmukh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082643#comment-15082643
 ] 

somil deshmukh commented on SPARK-12632:


I would like to work on this

> Make Parameter Descriptions Consistent for PySpark MLlib FPM and 
> Recommendation
> ---
>
> Key: SPARK-12632
> URL: https://issues.apache.org/jira/browse/SPARK-12632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up fpm.py 
> and recommendation.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11806) Spark 2.0 deprecations and removals

2016-01-05 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11806:

Description: 
This is an umbrella ticket to track things we are deprecating and removing in 
Spark 2.0.



  was:
This is an umbrella ticket to track things we are deprecating and removing in 
Spark 2.0.

All sub-tasks are currently assigned to Reynold to prevent others from picking 
up prematurely.




> Spark 2.0 deprecations and removals
> ---
>
> Key: SPARK-11806
> URL: https://issues.apache.org/jira/browse/SPARK-11806
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> This is an umbrella ticket to track things we are deprecating and removing in 
> Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-05 Thread Mario Briggs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082674#comment-15082674
 ] 

Mario Briggs commented on SPARK-12177:
--

implemented here - 
https://github.com/mariobriggs/spark/commit/2fcbb721b99b48e336ba7ef7c317c279c9483840

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2016-01-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12403.
---
Resolution: Not A Problem

OK, but as far as I can tell from this conversation it's an issue with a 
third-party ODBC driver.

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12645) SparkR add function hash

2016-01-05 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-12645:
---

 Summary: SparkR add function hash
 Key: SPARK-12645
 URL: https://issues.apache.org/jira/browse/SPARK-12645
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Yanbo Liang


SparkR add function hash for DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12645) SparkR support hash function

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082732#comment-15082732
 ] 

Apache Spark commented on SPARK-12645:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10597

> SparkR support hash function 
> -
>
> Key: SPARK-12645
> URL: https://issues.apache.org/jira/browse/SPARK-12645
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add hash function for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-05 Thread Nilanjan Raychaudhuri (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083142#comment-15083142
 ] 

Nilanjan Raychaudhuri commented on SPARK-7831:
--

I am working on a possible fix for this. I will submit a pull request soon

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-05 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083311#comment-15083311
 ] 

Daniel Darabos commented on SPARK-11293:


I have a somewhat contrived example that still leaks in 1.6.0. I started 
{{spark-shell --master 'local-cluster[2,2,1024]'}} and ran:

{code}
sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
x).groupByKey.asInstanceOf[org.apache.spark.rdd.ShuffledRDD[Int, Int, 
Iterable[Int]]].setKeyOrdering(implicitly[Ordering[Int]]).mapPartitions { it => 
it.take(1) }.collect
{code}

I've added extra logging around task memory acquisition so I would be able to 
see what is not released. These are the logs:

{code}
16/01/05 17:02:45 INFO Executor: Running task 0.0 in stage 13.0 (TID 24)
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Updating epoch to 7 and clearing 
cache
16/01/05 17:02:45 INFO TorrentBroadcast: Started reading broadcast variable 13
16/01/05 17:02:45 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes 
in memory (estimated size 2.3 KB, free 7.6 KB)
16/01/05 17:02:45 INFO TorrentBroadcast: Reading broadcast variable 13 took 6 ms
16/01/05 17:02:45 INFO MemoryStore: Block broadcast_13 stored as values in 
memory (estimated size 4.5 KB, free 12.1 KB)
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 6, fetching them
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.0.32:55147)
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Got the output locations
16/01/05 17:02:45 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks 
out of 2 blocks
16/01/05 17:02:45 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 
1 ms
16/01/05 17:02:45 ERROR TaskMemoryManager: Task 24 acquire 5.0 MB for null
16/01/05 17:02:45 ERROR TaskMemoryManager: Stack trace:
java.lang.Exception: here
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:187)
at 
org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:82)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/01/05 17:02:47 ERROR TaskMemoryManager: Task 24 acquire 15.0 MB for null
16/01/05 17:02:47 ERROR TaskMemoryManager: Stack trace:
java.lang.Exception: here
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:187)
at 
org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:82)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

[jira] [Created] (SPARK-12651) mllib deprecation messages mention non-existent version 1.7.0

2016-01-05 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-12651:
--

 Summary: mllib deprecation messages mention non-existent version 
1.7.0
 Key: SPARK-12651
 URL: https://issues.apache.org/jira/browse/SPARK-12651
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Trivial


Might be a problem in 1.6 also?

{code}
  @Since("1.4.0")
  @deprecated("Support for runs is deprecated. This param will have no effect 
in 1.7.0.", "1.6.0")
  def getRuns: Int = runs
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12431) add local checkpointing to GraphX

2016-01-05 Thread David Youd (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083419#comment-15083419
 ] 

David Youd commented on SPARK-12431:


Since localCheckpoint() was partially implemented, but in a way that doesn’t 
work, should this be changed from “Improvement” to “Bug”?

> add local checkpointing to GraphX
> -
>
> Key: SPARK-12431
> URL: https://issues.apache.org/jira/browse/SPARK-12431
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.5.2
>Reporter: Edward Seidl
>
> local checkpointing was added to RDD to speed up iterative spark jobs, but 
> this capability hasn't been added to GraphX.  Adding localCheckpoint to 
> GraphImpl, EdgeRDDImpl, and VertexRDDImpl greatly improved the speed of a 
> k-core algorithm I'm using (at the cost of fault tolerance, of course).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-05 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083326#comment-15083326
 ] 

Daniel Darabos commented on SPARK-11293:


Sorry, my example was overly complicated. This one triggers the same leak.

{code}
sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
x).groupByKey.mapPartitions { it => it.take(1) }.collect
{code}

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-01-05 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083397#comment-15083397
 ] 

kevin yu commented on SPARK-12648:
--

I can recreate the problem, I will look into this issue. Thanks.
Kevin

> UDF with Option[Double] throws ClassCastException
> -
>
> Key: SPARK-12648
> URL: https://issues.apache.org/jira/browse/SPARK-12648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mikael Valot
>
> I can write an UDF that returns an Option[Double], and the DataFrame's  
> schema is correctly inferred to be a nullable double. 
> However I cannot seem to be able to write a UDF that takes an Option as an 
> argument:
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> val conf = new SparkConf().setMaster("local[4]").setAppName("test")
> val sc = new SparkContext(conf)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
> "weight")
> import org.apache.spark.sql.functions._
> val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
> df.withColumn("plusTwo", addTwo(df("weight"))).show()
> =>
> 2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
> org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
> ~[na:na]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[na:na]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> ~[scala-library-2.10.5.jar:na]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12649) support reading bucketed table

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12649:


Assignee: (was: Apache Spark)

> support reading bucketed table
> --
>
> Key: SPARK-12649
> URL: https://issues.apache.org/jira/browse/SPARK-12649
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12649) support reading bucketed table

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083208#comment-15083208
 ] 

Apache Spark commented on SPARK-12649:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10604

> support reading bucketed table
> --
>
> Key: SPARK-12649
> URL: https://issues.apache.org/jira/browse/SPARK-12649
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12649) support reading bucketed table

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12649:


Assignee: Apache Spark

> support reading bucketed table
> --
>
> Key: SPARK-12649
> URL: https://issues.apache.org/jira/browse/SPARK-12649
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-01-05 Thread Mikael Valot (JIRA)

Mikael Valot created SPARK-12648:


 Summary: UDF with Option[Double] throws ClassCastException
 Key: SPARK-12648
 URL: https://issues.apache.org/jira/browse/SPARK-12648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Mikael Valot


I can write an UDF that returns an Option[Double], and the DataFrame's  schema 
is correctly inferred to be a nullable double. 
However I cannot seem to be able to write a UDF that takes an Option as an 
argument:

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlc = new SQLContext(sc)
import sqlc.implicits._
val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
"weight")
import org.apache.spark.sql.functions._
val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
df.withColumn("plusTwo", addTwo(df("weight"))).show()

=>
2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
~[na:na]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[na:na]
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
 ~[spark-sql_2.10-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
 ~[spark-sql_2.10-1.6.0.jar:1.6.0]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
~[scala-library-2.10.5.jar:na]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12649) support reading bucketed table

2016-01-05 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-12649:
---

 Summary: support reading bucketed table
 Key: SPARK-12649
 URL: https://issues.apache.org/jira/browse/SPARK-12649
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode

2016-01-05 Thread John Vines (JIRA)

John Vines created SPARK-12650:
--

 Summary: No means to specify Xmx settings for SparkSubmit in 
yarn-cluster mode
 Key: SPARK-12650
 URL: https://issues.apache.org/jira/browse/SPARK-12650
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2
 Environment: Hadoop 2.6.0
Reporter: John Vines


Background-
I have an app master designed to do some work and then launch a spark job.

Issue-
If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, leading 
to the jvm taking a default heap which is relatively large. This causes a large 
amount of vmem to be taken, so that it is killed by yarn. This can be worked 
around by disabling Yarn's vmem check, but that is a hack.

If I run it in yarn-client mode, it's fine as long as my container has enough 
space for the driver, which is manageable. But I feel that the utter lack of 
Xmx settings for what I believe is a very small jvm is a problem.

I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2016-01-05 Thread Anson Abraham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083258#comment-15083258
 ] 

Anson Abraham commented on SPARK-11227:
---

I am having this issue as well, in my environment.  But i'm not running mesos 
or yarn.  it only occurs w/ spark-submit.  It works with spark 1.4.x, but 1.5.x 
> i get the same error, when my cluster is in HA mode (but non-yarn or mesos).  
I double checked configs and it is correct. Any help would be appreciated here. 

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused

[jira] [Commented] (SPARK-12609) Make R to JVM timeout configurable

2016-01-05 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083506#comment-15083506
 ] 

Felix Cheung commented on SPARK-12609:
--

It looks like the timeout to socketConnection is merely the time from 
establishment - we could hardcode it to something arbitrary long (eg. 1 day)
We should have some way to check if "is the JVM backend alive" - for that we 
could have some kind of keep-alive ping for each request?

> Make R to JVM timeout configurable 
> ---
>
> Key: SPARK-12609
> URL: https://issues.apache.org/jira/browse/SPARK-12609
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> The timeout from R to the JVM is hardcoded at 6000 seconds in 
> https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22
> This results in Spark jobs that take more than 100 minutes to always fail. We 
> should make this timeout configurable through SparkConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083556#comment-15083556
 ] 

Apache Spark commented on SPARK-12521:
--

User 'xguo27' has created a pull request for this issue:
https://github.com/apache/spark/pull/10473

> DataFrame Partitions in java does not work
> --
>
> Key: SPARK-12521
> URL: https://issues.apache.org/jira/browse/SPARK-12521
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.5.2
>Reporter: Sergey Podolsky
>
> Hello,
> Partition does not work in Java interface of the DataFrame:
> {code}
> SQLContext sqlContext = new SQLContext(sc);
> Map options = new HashMap<>();
> options.put("driver", ORACLE_DRIVER);
> options.put("url", ORACLE_CONNECTION_URL);
> options.put("dbtable",
> "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt");
> options.put("lowerBound", "2704225000");
> options.put("upperBound", "2704226000");
> options.put("partitionColumn", "ID");
> options.put("numPartitions", "10");
> DataFrame jdbcDF = sqlContext.load("jdbc", options);
> List jobsRows = jdbcDF.collectAsList();
> System.out.println(jobsRows.size());
> {code}
> gives  while expected 1000. Is it because of big decimal of boundaries or 
> partitioins does not work at all in Java?
> Thanks.
> Sergey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-05 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-12654:
-

Assignee: Thomas Graves

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
> at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-05 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083482#comment-15083482
 ] 

Mark Grover commented on SPARK-12426:
-

Sean and Josh,
I got to the bottom of this. This is because docker sucks when bubbling up the 
error that docker engine is not running on the machine running the unit tests. 
The instructions for installing docker engine on various OSs are at 
https://docs.docker.com/engine/installation/
Once installed the docker service needs to be started, if it's not already 
running. On Linux, this is simply {{sudo service docker start}} and then our 
docker integration tests pass.

Sorry that I didn't get a chance to look into it around 1.6 rc time, holidays 
got in the way.

I am thinking of adding this info on [this wiki 
page|https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ].
 Please let me know if you think there is a better place, that's the best I 
could find. I don't seem to have access to edit that page, can one of you 
please give me access?

Also, I was trying to search in the code for any puppet recipes we maintain for 
the setting up build slaves. In order, if our Jenkins infra were wiped out, how 
do we make sure docker-engine is installed and running? How do we maintain keep 
track of build dependencies? Thanks in advance!

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   ...
>   Cause: java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   ...
>   Cause:

[jira] [Created] (SPARK-12652) Upgrade py4j to the incoming version 0.9.1

2016-01-05 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12652:


 Summary: Upgrade py4j to the incoming version 0.9.1
 Key: SPARK-12652
 URL: https://issues.apache.org/jira/browse/SPARK-12652
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Shixiong Zhu


Upgrade py4j when py4j 0.9.1 is out. Mostly because it fixes two critical 
issues: SPARK-12511 and SPARK-12617



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12577) better support of parentheses in partition by and order by clause of window function's over clause

2016-01-05 Thread Thomas Sebastian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083590#comment-15083590
 ] 

Thomas Sebastian commented on SPARK-12577:
--

Hi Reynold,
Would you share some thoughts on how did you replicate this issue?
-  using sqlContext  or API? 
- which version of spark.?
- A bit more failure message details( what sort of exception).?
Also, I see some paranthesis close mismatch in the PASS conditions mentioned.


> better support of parentheses in partition by and order by clause of window 
> function's over clause
> --
>
> Key: SPARK-12577
> URL: https://issues.apache.org/jira/browse/SPARK-12577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Right now, Hive's parser support
> {code}
> -- PASS
> SELECT SUM(1) OVER (PARTITION BY a + 1 - b * c / d FROM src;
> SELECT SUM(1) OVER (PARTITION BY (a + 1 - b * c / d) FROM src;
> {code}
> But, the following one is not accepted
> {code}
> -- FAIL
> SELECT SUM(1) OVER (PARTITION BY (a) + 1 - b * c / d) FROM src;
> {code}
> We should fix it in our own parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12317) Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf

2016-01-05 Thread kevin yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevin yu updated SPARK-12317:
-
Summary: Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and 
SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf  
(was: Support configurate value with unit(e.g. kb/mb/gb) in SQL)

> Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and 
> SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf
> 
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2016-01-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-4036.

Resolution: Later

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf, dig-hair-eye-train.model, 
> features.hair-eye, sample-input, sample-output
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12438) Add SQLUserDefinedType support for encoder

2016-01-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12438.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10390
[https://github.com/apache/spark/pull/10390]

> Add SQLUserDefinedType support for encoder
> --
>
> Key: SPARK-12438
> URL: https://issues.apache.org/jira/browse/SPARK-12438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> We should add SQLUserDefinedType support for encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12615) Remove some deprecated APIs in RDD/SparkContext

2016-01-05 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12615.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10569
[https://github.com/apache/spark/pull/10569]

> Remove some deprecated APIs in RDD/SparkContext
> ---
>
> Key: SPARK-12615
> URL: https://issues.apache.org/jira/browse/SPARK-12615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12098) Cross validator with multi-arm bandit search

2016-01-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-12098.
-
Resolution: Later

> Cross validator with multi-arm bandit search
> 
>
> Key: SPARK-12098
> URL: https://issues.apache.org/jira/browse/SPARK-12098
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xusen Yin
>
> The classic cross-validation requires all inner classifiers iterate to a 
> fixed number of iterations, or until convergence states. It is costly 
> especially in the massive data scenario. According to the paper 
> Non-stochastic Best Arm Identification and Hyperparameter Optimization 
> (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
> the amount of total iterations of cross-validation with multi-armed bandit 
> search.
> The multi-armed bandit search for cross-validation (bandit search for short) 
> requires warm-start of ml algorithms, and fine-grained control of the inner 
> behavior of the corss validator.
> Since there are bunch of algorithms of bandit search to find the best 
> parameter set, we intent to provide only a few of them in the beginning to 
> reduce the test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2016-01-05 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083631#comment-15083631
 ] 

Joseph K. Bradley commented on SPARK-2344:
--

Hi everyone,  thanks a lot for your work and discussion about this.  However, I 
think we'll need to postpone this feature because of limited review bandwidth 
and a need to focus on other items such as language API completeness, etc.  
Would you be able to post your implementations as Spark packages?

For a less common algorithm such as this, it will also be important to collect 
feedback about how much it improves upon existing MLlib algorithms, so if you 
get feedback or results from users about your package, please post here.  I'll 
close this JIRA for now but will follow it.

Thanks for your understanding.

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12651) mllib deprecation messages mention non-existent version 1.7.0

2016-01-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083474#comment-15083474
 ] 

Sean Owen commented on SPARK-12651:
---

I've got this covered in SPARK-12618 / 
https://github.com/apache/spark/pull/10570 already

> mllib deprecation messages mention non-existent version 1.7.0
> -
>
> Key: SPARK-12651
> URL: https://issues.apache.org/jira/browse/SPARK-12651
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> Might be a problem in 1.6 also?
> {code}
>   @Since("1.4.0")
>   @deprecated("Support for runs is deprecated. This param will have no effect 
> in 1.7.0.", "1.6.0")
>   def getRuns: Int = runs
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2016-01-05 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083604#comment-15083604
 ] 

Joseph K. Bradley commented on SPARK-4036:
--

[~hujiayin]  Thanks very much for your work on this, but I think we need to 
discuss this more before putting it into MLlib.  The primary reasons are:
* We have limited review bandwidth, and we need to focus on non-feature items 
currently (API improvements and completeness, bugs, etc.).
* For a big new feature like this, we would need to do a proper design document 
and discussion before a PR.  CRFs in particular are a very broad field, so it 
would be important to discuss scope and generality (linear vs general CRFs, 
applications such as NLP, vision, etc., or even a more general graphical model 
framework).

In the meantime, I'd recommend you create a Spark package based on your work.  
That will let users take advantage of it, and you can encourage them to post 
feedback on the package site or here to continue the discussion.

I'd like to close this JIRA for now, but I'll continue to watch the discussion 
on it.

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf, dig-hair-eye-train.model, 
> features.hair-eye, sample-input, sample-output
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3872) Rewrite the test for ActorInputStream.

2016-01-05 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083632#comment-15083632
 ] 

Josh Rosen commented on SPARK-3872:
---

Is this now "Won't Fix" for 2.0?

> Rewrite the test for ActorInputStream. 
> ---
>
> Key: SPARK-3872
> URL: https://issues.apache.org/jira/browse/SPARK-3872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12643) Set lib directory for antlr

2016-01-05 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12643.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.0.0

> Set lib directory for antlr
> ---
>
> Key: SPARK-12643
> URL: https://issues.apache.org/jira/browse/SPARK-12643
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.0.0
>
>
> Without setting lib directory for antlr, the updates of imported grammar 
> files can not be detected. So SparkSqlParser.g will not be rebuilt 
> automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12098) Cross validator with multi-arm bandit search

2016-01-05 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083610#comment-15083610
 ] 

Joseph K. Bradley commented on SPARK-12098:
---

[~yinxusen] Thanks for your work on this, but I think we need to delay this 
feature.  It's something we'll probably want to add in the future, but we just 
don't have the bandwidth right now for it.  Could you publish your work as a 
Spark package for the time being?  It would be great if you could get some 
feedback about the package from users, so that we can get more info about how 
much it improves on CrossValidator.  Thanks for your understanding.

> Cross validator with multi-arm bandit search
> 
>
> Key: SPARK-12098
> URL: https://issues.apache.org/jira/browse/SPARK-12098
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xusen Yin
>
> The classic cross-validation requires all inner classifiers iterate to a 
> fixed number of iterations, or until convergence states. It is costly 
> especially in the massive data scenario. According to the paper 
> Non-stochastic Best Arm Identification and Hyperparameter Optimization 
> (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
> the amount of total iterations of cross-validation with multi-armed bandit 
> search.
> The multi-armed bandit search for cross-validation (bandit search for short) 
> requires warm-start of ml algorithms, and fine-grained control of the inner 
> behavior of the corss validator.
> Since there are bunch of algorithms of bandit search to find the best 
> parameter set, we intent to provide only a few of them in the beginning to 
> reduce the test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2016-01-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-2344.

Resolution: Later

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-05 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-12654:
-

 Summary: sc.wholeTextFiles with spark.hadoop.cloneConf=true fails 
on secure Hadoop
 Key: SPARK-12654
 URL: https://issues.apache.org/jira/browse/SPARK-12654
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Thomas Graves


On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  Then 
try to use:
val files =  sc.wholeTextFiles("dir") 
files.collect()
and it fails with:

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
 
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
at 
org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
at 
org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7128) Add generic bagging algorithm to spark.ml

2016-01-05 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083671#comment-15083671
 ] 

Joseph K. Bradley commented on SPARK-7128:
--

But [~fliang] if you have a chance, then it'd be a good Spark package!

> Add generic bagging algorithm to spark.ml
> -
>
> Key: SPARK-7128
> URL: https://issues.apache.org/jira/browse/SPARK-7128
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Bagging algorithm 
> which can work with any Classifier or Regressor.  Creating this feature will 
> require researching the possible variants and extensions of bagging which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12439) Fix toCatalystArray and MapObjects

2016-01-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12439.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10391
[https://github.com/apache/spark/pull/10391]

> Fix toCatalystArray and MapObjects
> --
>
> Key: SPARK-12439
> URL: https://issues.apache.org/jira/browse/SPARK-12439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> In toCatalystArray, we should look at the data type returned by dataTypeFor 
> instead of silentSchemaFor, to determine if the element is native type. An 
> obvious problem is when the element is Option[Int] class, 
> catalsilentSchemaFor will return Int, then we will wrongly recognize the 
> element is native type.
> There is another problem when using Option as array element. When we encode 
> data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to 
> construct an array for it later. But in MapObjects, we don't check if the 
> return value of lambdaFunction is null or not. That causes a bug that the 
> decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead 
> of Seq(1, 2, null).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12577) better support of parentheses in partition by and order by clause of window function's over clause

2016-01-05 Thread Thomas Sebastian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083590#comment-15083590
 ] 

Thomas Sebastian edited comment on SPARK-12577 at 1/5/16 7:41 PM:
--

Hi Reynold,
Would you share some thoughts on how did you replicate this issue?
- which version of spark.?
- A bit more failure message details( what sort of exception).?

Do you mean to say, when the sqlContext based queries(spark-shell) are fired as 
in the above FAIL conditions, it does not go through, where as it is accepted 
via HiveQL ?
Also, I see some paranthesis close mismatch in the PASS conditions mentioned.



was (Author: thomastechs):
Hi Reynold,
Would you share some thoughts on how did you replicate this issue?
-  using sqlContext  or API? 
- which version of spark.?
- A bit more failure message details( what sort of exception).?
Also, I see some paranthesis close mismatch in the PASS conditions mentioned.


> better support of parentheses in partition by and order by clause of window 
> function's over clause
> --
>
> Key: SPARK-12577
> URL: https://issues.apache.org/jira/browse/SPARK-12577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Right now, Hive's parser support
> {code}
> -- PASS
> SELECT SUM(1) OVER (PARTITION BY a + 1 - b * c / d FROM src;
> SELECT SUM(1) OVER (PARTITION BY (a + 1 - b * c / d) FROM src;
> {code}
> But, the following one is not accepted
> {code}
> -- FAIL
> SELECT SUM(1) OVER (PARTITION BY (a) + 1 - b * c / d) FROM src;
> {code}
> We should fix it in our own parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8108) Build Hive module by default (i.e. remove -Phive profile)

2016-01-05 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083645#comment-15083645
 ] 

Josh Rosen commented on SPARK-8108:
---

+1 on this change; it'd let us simplify certain build scripts. Would be great 
if someone could investigate this. Note that we might still want to have a 
dummy no-op {{-Phive}} profile for compatibility with third-party packaging 
scripts, but maybe that's not a huge deal.

> Build Hive module by default (i.e. remove -Phive profile)
> -
>
> Key: SPARK-8108
> URL: https://issues.apache.org/jira/browse/SPARK-8108
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Reporter: Reynold Xin
>
> I think this is blocked by a jline conflict between Scala 2.11 and Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-05 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083643#comment-15083643
 ] 

Thomas Graves commented on SPARK-12654:
---

So the bug here is that WholeTextFileRDD.getPartitions has:
val conf = getConf
in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it 
uses that to create a new newJobContext.

The newJobContext will copy credentials around, but credentials are only 
present in a JobConf not in a Hadoop Configuration. So basically when it is 
cloning the hadoop configuration its changing it from a JobConf to 
Configuration and dropping the credentials that were there. NewHadoopRDD just 
uses the conf passed in for the getPartitions (not getConf) which is why it 
works.  

Need to investigate to see if wholeTextfiles should be using conf or if getConf 
needs to change.

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
>

[jira] [Updated] (SPARK-12655) GraphX does not unpersist RDDs

2016-01-05 Thread Alexander Pivovarov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-12655:

Description: 
Looks like Graph does not clean all RDDs from the cache on unpersist
{code}
// open spark-shell 1.5.2
// run

import org.apache.spark.graphx._

val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)

val g0 = Graph(vert, edges)
val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
val cc = g.connectedComponents()

cc.unpersist()
g.unpersist()
g0.unpersist()
vert.unpersist()
edges.unpersist()
{code}
open http://localhost:4040/storage/
Spark UI 4040 Storage page still shows 2 items
{code}
VertexRDD   Memory Deserialized 1x Replicated   1   100%1688.0 
B0.0 B   0.0 B
EdgeRDD Memory Deserialized 1x Replicated   2   100%4.7 KB  0.0 B   
0.0 B
{code}

  was:
Looks like Graph does not clean all RDDs from the cache on unpersist
{code}
// open spark-shell 1.5.2
// run

import org.apache.spark.graphx._

val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)

val g0 = Graph(vert, edges)
val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
val cc = g.connectedComponents()

cc.unpersist()
g.unpersist()
g0.unpersist()
vert.unpersist()
edges.unpersist()

// open http://localhost:4040/storage/
// Spark UI 4040 Storage page still shows 2 items

// VertexRDDMemory Deserialized 1x Replicated   1   100%1688.0 
B0.0 B   0.0 B
// EdgeRDD  Memory Deserialized 1x Replicated   2   100%4.7 KB  
0.0 B   0.0 B
{code}


> GraphX does not unpersist RDDs
> --
>
> Key: SPARK-12655
> URL: https://issues.apache.org/jira/browse/SPARK-12655
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2
>Reporter: Alexander Pivovarov
>
> Looks like Graph does not clean all RDDs from the cache on unpersist
> {code}
> // open spark-shell 1.5.2
> // run
> import org.apache.spark.graphx._
> val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
> val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)
> val g0 = Graph(vert, edges)
> val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
> val cc = g.connectedComponents()
> cc.unpersist()
> g.unpersist()
> g0.unpersist()
> vert.unpersist()
> edges.unpersist()
> {code}
> open http://localhost:4040/storage/
> Spark UI 4040 Storage page still shows 2 items
> {code}
> VertexRDD Memory Deserialized 1x Replicated   1   100%1688.0 
> B0.0 B   0.0 B
> EdgeRDD   Memory Deserialized 1x Replicated   2   100%4.7 KB  
> 0.0 B   0.0 B
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12655) GraphX does not unpersist RDDs

2016-01-05 Thread Alexander Pivovarov (JIRA)

Alexander Pivovarov created SPARK-12655:
---

 Summary: GraphX does not unpersist RDDs
 Key: SPARK-12655
 URL: https://issues.apache.org/jira/browse/SPARK-12655
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.5.2
Reporter: Alexander Pivovarov


Looks like Graph does not clean all RDDs from the cache on unpersist
{code}
// open spark-shell 1.5.2
// run

import org.apache.spark.graphx._

val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)

val g0 = Graph(vert, edges)
val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
val cc = g.connectedComponents()

cc.unpersist()
g.unpersist()
g0.unpersist()
vert.unpersist()
edges.unpersist()

// open http://localhost:4040/storage/
// Spark UI 4040 Storage page still shows 2 items

// VertexRDDMemory Deserialized 1x Replicated   1   100%1688.0 
B0.0 B   0.0 B
// EdgeRDD  Memory Deserialized 1x Replicated   2   100%4.7 KB  
0.0 B   0.0 B
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS

2016-01-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-11696.
-
Resolution: Won't Fix

[~Narine] I just commented on your PR about this, but I'd like to close this 
and focus on the spark.ml DataFrame-based API instead.  It'd be nice to get 
your feedback there, on separate JIRAs.  Thank you!

> MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS
> --
>
> Key: SPARK-11696
> URL: https://issues.apache.org/jira/browse/SPARK-11696
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Narine Kokhlikyan
>
> Hi there,
> in current implementation the Optimization:optimize() method returns only the 
> weights for the features. 
> However, we could make it more transparent and provide more parameters about 
> the optimization, e.g. number of iteration, error, etc.
> As discussed in bellow jira, this will be useful: 
> https://issues.apache.org/jira/browse/SPARK-5575
> What do you think ?
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12295) Manage the memory used by window function

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12295:


Assignee: Apache Spark

> Manage the memory used by window function
> -
>
> Key: SPARK-12295
> URL: https://issues.apache.org/jira/browse/SPARK-12295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> The buffered rows for a given frame should use UnsafeRow, and stored as pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12295) Manage the memory used by window function

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12295:


Assignee: (was: Apache Spark)

> Manage the memory used by window function
> -
>
> Key: SPARK-12295
> URL: https://issues.apache.org/jira/browse/SPARK-12295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The buffered rows for a given frame should use UnsafeRow, and stored as pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12295) Manage the memory used by window function

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083707#comment-15083707
 ] 

Apache Spark commented on SPARK-12295:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10605

> Manage the memory used by window function
> -
>
> Key: SPARK-12295
> URL: https://issues.apache.org/jira/browse/SPARK-12295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The buffered rows for a given frame should use UnsafeRow, and stored as pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition

2016-01-05 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085007#comment-15085007
 ] 

Reynold Xin commented on SPARK-12662:
-

Yea [~yhuai] and I talked offline and thought just adding a local sort would be 
a better solution. It'd make performance worse, but at least guarantee 
correctness.

> Add document to randomSplit to explain the sampling depends on the ordering 
> of the rows in a partition
> --
>
> Key: SPARK-12662
> URL: https://issues.apache.org/jira/browse/SPARK-12662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
>
> With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following 
> code will provide overlapped rows for two DFs returned by the randomSplit. 
> {code}
> sqlContext.sql("drop table if exists test")
> val x = sc.parallelize(1 to 210)
> case class R(ID : Int)
> sqlContext.createDataFrame(x.map 
> {R(_)}).write.format("json").saveAsTable("bugsc1597")
> var df = sql("select distinct ID from test")
> var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
> a.registerTempTable("a")
> b.registerTempTable("b")
> val intersectDF = a.intersect(b)
> intersectDF.show
> {code}
> The reason is that {{sql("select distinct ID from test")} does not guarantee 
> the ordering rows in a partition. It will be good to add more document to the 
> api doc to explain it. To make intersectDF contain 0 row, the df needs to 
> have fixed row ordering within a partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12671) Improve tests for better coverage

2016-01-05 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-12671:
---
Affects Version/s: 2.0.0

> Improve tests for better coverage
> -
>
> Key: SPARK-12671
> URL: https://issues.apache.org/jira/browse/SPARK-12671
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> Ideally we want to have 100% test coverage for CSV data source in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12671) Improve tests for better coverage

2016-01-05 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-12671:
--

 Summary: Improve tests for better coverage
 Key: SPARK-12671
 URL: https://issues.apache.org/jira/browse/SPARK-12671
 Project: Spark
  Issue Type: Sub-task
Reporter: Hossein Falaki


Ideally we want to have 100% test coverage for CSV data source in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7924) Consolidate example code in MLlib

2016-01-05 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085175#comment-15085175
 ] 

Xusen Yin commented on SPARK-7924:
--

[~mengxr] One reminder, don't we merge the 
https://issues.apache.org/jira/browse/SPARK-11399 first? Since examples I left 
behind are convolved with docs. I don't think current "include_example" 
supports those examples well.

> Consolidate example code in MLlib
> -
>
> Key: SPARK-7924
> URL: https://issues.apache.org/jira/browse/SPARK-7924
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This JIRA is an umbrella for consolidating example code in MLlib, now that we 
> are able to insert code snippets from examples into the user guide.  This 
> will contain tasks not already handled by [SPARK-11337].
> Goal: Have all example code in the {{examples/}} folder, and insert code 
> snippets for examples into the user guide.  This will make the example code 
> easily testable and reduce duplication.
> We will have 1 subtask per example.  If you would like to help, please either 
> create a subtask or comment below asking us to create a subtask for you.
> For an example to follow, look at:
> * 
> [https://github.com/apache/spark/blob/0171b71e9511cef512e96a759e407207037f3c49/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala]
> * TF-IDF example in 
> [https://raw.githubusercontent.com/apache/spark/0171b71e9511cef512e96a759e407207037f3c49/docs/ml-features.md]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11139) Make SparkContext.stop() exception-safe

2016-01-05 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085172#comment-15085172
 ] 

Felix Cheung commented on SPARK-11139:
--

it looks like this might have been resolved by 
https://github.com/apache/spark/commit/27ae851ce16082775ffbcb5b8fc6bdbe65dc70fc


> Make SparkContext.stop() exception-safe
> ---
>
> Key: SPARK-11139
> URL: https://issues.apache.org/jira/browse/SPARK-11139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In SparkContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Work has been done in SPARK-4194 to allow for cleanup to partial 
> initialization.
> Similarly issue in StreamingContext SPARK-11137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12665) Remove deprecated and unused classes

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084999#comment-15084999
 ] 

Apache Spark commented on SPARK-12665:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/10613

> Remove deprecated and unused classes
> 
>
> Key: SPARK-12665
> URL: https://issues.apache.org/jira/browse/SPARK-12665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Kousuke Saruta
>
> Whole code of Vector.scala and GraphKryoRegistrator  are no longer used so 
> it's time to remove them in Spark 2.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12665) Remove deprecated and unused classes

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12665:


Assignee: Apache Spark

> Remove deprecated and unused classes
> 
>
> Key: SPARK-12665
> URL: https://issues.apache.org/jira/browse/SPARK-12665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> Whole code of Vector.scala and GraphKryoRegistrator  are no longer used so 
> it's time to remove them in Spark 2.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12665) Remove deprecated and unused classes

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12665:


Assignee: (was: Apache Spark)

> Remove deprecated and unused classes
> 
>
> Key: SPARK-12665
> URL: https://issues.apache.org/jira/browse/SPARK-12665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Kousuke Saruta
>
> Whole code of Vector.scala and GraphKryoRegistrator  are no longer used so 
> it's time to remove them in Spark 2.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12393) Add read.text and write.text for SparkR

2016-01-05 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12393.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10348
[https://github.com/apache/spark/pull/10348]

> Add read.text and write.text for SparkR
> ---
>
> Key: SPARK-12393
> URL: https://issues.apache.org/jira/browse/SPARK-12393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
> Fix For: 2.0.0, 1.6.1
>
>
> Add read.text and write.text for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12670) Use spark internal utilities wherever possible

2016-01-05 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-12670:
--

 Summary: Use spark internal utilities wherever possible
 Key: SPARK-12670
 URL: https://issues.apache.org/jira/browse/SPARK-12670
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hossein Falaki


The initial code from spark-csv does not rely on Spark's internal utilities to 
maintain backward compatibility across multiple versions of Spark. 
* Type casting utilities
* Schema inference utilities
* Unit test utilities



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12668) Renaming CSV options to be similar to Pandas and R

2016-01-05 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-12668:
--

 Summary: Renaming CSV options to be similar to Pandas and R 
 Key: SPARK-12668
 URL: https://issues.apache.org/jira/browse/SPARK-12668
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hossein Falaki
 Fix For: 2.0.0


Renaming options to be similar to Pandas and R 
* Alias for delimiter -> sep 
* charset -> encoding 
* codec -> compression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12669) Organize options for default values

2016-01-05 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-12669:
--

 Summary: Organize options for default values
 Key: SPARK-12669
 URL: https://issues.apache.org/jira/browse/SPARK-12669
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hossein Falaki


CSV data source in SparkSQL should be able to differentiate empty string, null, 
NaN, “N/A” (maybe data type dependent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT

2016-01-05 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12666:
---
Description: 
Symptom:

I cloned the latest master of {{spark-redshift}}, then used {{sbt 
publishLocal}} to publish it to my Ivy cache. When I tried running 
{{./bin/spark-shell --packages 
com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
into {{spark-shell}}, I received the following cryptic error:

{code}
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration not found in 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It was required 
from org.apache.spark#spark-submit-parent;1.0 default]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

I think the problem here is that Spark is declaring a dependency on the 
spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
admittedly limited understanding of Ivy, the default configuration will be the 
only configuration defined in an Ivy artifact if that artifact defines no other 
configurations. Thus, for Maven artifacts I think the default configuration 
will end up mapping to Maven's regular JAR dependency (i.e. Maven artifacts 
don't declare Ivy configurations so they implicitly have the {{default}} 
configuration) but for Ivy artifacts I think we can run into trouble when 
loading artifacts which explicitly define their own configurations, since those 
artifacts might not have a configuration named {{default}}.

I spent a bit of time playing around with the SparkSubmit code to see if I 
could fix this but wasn't able to completely resolve the issue.

/cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
if you'd like)

  was:
Symptom:

I cloned the latest master of {{spark-redshift}}, then used {{sbt 
publishLocal}} to publish it to my Ivy cache. When I tried running 
{{./bin/spark-shell --packages 
com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
into {{spark-shell}}, I received the following cryptic error:

{code}
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration not found in 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It was required 
from org.apache.spark#spark-submit-parent;1.0 default]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

I think the problem here is that Spark is declaring a dependency on the 
spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
admittedly limited understanding of Ivy, the default configuration will be the 
only configuration defined in an Ivy artifact if that artifact defines no other 
configurations. Thus, for Maven artifacts I think the default configuration 
will end up mapping to Maven's regular JAR dependency but for Ivy artifacts I 
think we can run into trouble when loading artifacts which explicitly define 
their own configurations, since those artifacts might not have a configuration 
named {{default}}.

I spent a bit of time playing around with the SparkSubmit code to see if I 
could fix this but wasn't able to completely resolve the issue.

/cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
if you'd like)


> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main"

[jira] [Commented] (SPARK-12433) Make parts of the core Spark testing API public to assist developers making their own tests.

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085050#comment-15085050
 ] 

Apache Spark commented on SPARK-12433:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10614

> Make parts of the core Spark testing API public to assist developers making 
> their own tests.
> 
>
> Key: SPARK-12433
> URL: https://issues.apache.org/jira/browse/SPARK-12433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> See parent JIRA for proposed API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12098) Cross validator with multi-arm bandit search

2016-01-05 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085075#comment-15085075
 ] 

Xusen Yin commented on SPARK-12098:
---

[~josephkb] I understand that. I'll try to publish it as a Spark package first.

> Cross validator with multi-arm bandit search
> 
>
> Key: SPARK-12098
> URL: https://issues.apache.org/jira/browse/SPARK-12098
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xusen Yin
>
> The classic cross-validation requires all inner classifiers iterate to a 
> fixed number of iterations, or until convergence states. It is costly 
> especially in the massive data scenario. According to the paper 
> Non-stochastic Best Arm Identification and Hyperparameter Optimization 
> (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
> the amount of total iterations of cross-validation with multi-armed bandit 
> search.
> The multi-armed bandit search for cross-validation (bandit search for short) 
> requires warm-start of ml algorithms, and fine-grained control of the inner 
> behavior of the corss validator.
> Since there are bunch of algorithms of bandit search to find the best 
> parameter set, we intent to provide only a few of them in the beginning to 
> reduce the test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12422) Binding Spark Standalone Master to public IP fails

2016-01-05 Thread Tommy Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084913#comment-15084913
 ] 

Tommy Yu commented on SPARK-12422:
--

Hi 
For docker images, can you please check /etc/hosts file and remove first line 
for ip & hosts wrapper.

Suggest take a look below doc if you want set up a cluster env base on docker.

sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html

Regards.

> Binding Spark Standalone Master to public IP fails
> --
>
> Key: SPARK-12422
> URL: https://issues.apache.org/jira/browse/SPARK-12422
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.5.2
> Environment: Fails on direct deployment on Mac OSX and also in Docker 
> Environment (running on OSX or Ubuntu)
>Reporter: Bennet Jeutter
>Priority: Blocker
>
> The start of the Spark Standalone Master fails, when the host specified 
> equals the public IP address. For example I created a Docker Machine with 
> public IP 192.168.99.100, then I run:
> /usr/spark/bin/spark-class org.apache.spark.deploy.master.Master -h 
> 192.168.99.100
> It'll fail with:
> Exception in thread "main" java.net.BindException: Failed to bind to: 
> /192.168.99.100:7093: Service 'sparkMaster' failed after 16 retries!
>   at 
> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
>   at 
> akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
>   at 
> akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
>   at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
>   at scala.util.Try$.apply(Try.scala:161)
>   at scala.util.Success.map(Try.scala:206)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> So I thought oh well, lets just bind to the local IP and access it via public 
> IP - this doesn't work, it will give:
> dropping message [class akka.actor.ActorSelectionMessage] for non-local 
> recipient [Actor[akka.tcp://sparkMaster@192.168.99.100:7077/]] arriving at 
> [akka.tcp://sparkMaster@192.168.99.100:7077] inbound addresses are 
> [akka.tcp://sparkMaster@spark-master:7077]
> So there is currently no possibility to run all this... related stackoverflow 
> issues:
> * 
> http://stackoverflow.com/questions/31659228/getting-java-net-bindexception-when-attempting-to-start-spark-master-on-ec2-node
> * 
> http://stackoverflow.com/questions/33768029/access-apache-spark-standalone-master-via-ip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3873) Scala style: check import ordering

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084613#comment-15084613
 ] 

Apache Spark commented on SPARK-3873:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10612

> Scala style: check import ordering
> --
>
> Key: SPARK-3873
> URL: https://issues.apache.org/jira/browse/SPARK-3873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12563) "No suitable driver" when calling JdbcUtils.saveTable in isolation

2016-01-05 Thread Sonya Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084963#comment-15084963
 ] 

Sonya Huang commented on SPARK-12563:
-

Thanks [~jayadevan.m]. It looks like right after you posted this, someone made 
a commit to fix another issue (SPARK-12579) which might make this obsolete.

I also realized that DataFrameWriter.jdbc actually calls JdbcUtils.saveTable so 
it isn't a redundant function as I thought when I first encountered this.

> "No suitable driver" when calling JdbcUtils.saveTable in isolation
> --
>
> Key: SPARK-12563
> URL: https://issues.apache.org/jira/browse/SPARK-12563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sonya Huang
>Priority: Minor
>
> When calling the following function
> JdbcUtils.saveTable(df, url, table, properties)
> the following exception is thrown.
> Exception in thread "main" java.sql.SQLException: No suitable driver
> at java.sql.DriverManager.getDriver(DriverManager.java:315)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.getDriverClassName(DriverRegistry.scala:55)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:212)
> at com.pul.sive.TestThingy$$anonfun$main$2.apply(TestThingy.scala:77)
> at com.pul.sive.TestThingy$$anonfun$main$2.apply(TestThingy.scala:69)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at com.pul.sive.TestThingy$.main(TestThingy.scala:69)
> at com.pul.sive.TestThingy.main(TestThingy.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> However, the above works if the following is called directly before:
> JdbcUtils.createConnection(url, properties)
> It appears that JdbcUtils.saveTable attempts to get the driver from 
> DriverRegistry before reading the contents of the properties argument. 
> Jdbc.createConnection adds the driver to DriverRegistry as a side effect, so 
> this lookup works.
> However it also appears that DataFrame.write.jdbc(url, table, properties) 
> accomplishes the same thing with more flexibility, so I am not sure if 
> JdbcUtils.saveTable is redundant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11607) Update MLlib website for 1.6

2016-01-05 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085045#comment-15085045
 ] 

Xiangrui Meng commented on SPARK-11607:
---

Listed bisecting k-means and accelerated failure time model on MLlib page.

> Update MLlib website for 1.6
> 
>
> Key: SPARK-11607
> URL: https://issues.apache.org/jira/browse/SPARK-11607
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update MLlib's website to include features in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12667) Remove block manager's internal "external block store" API

2016-01-05 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12667:
---

 Summary: Remove block manager's internal "external block store" API
 Key: SPARK-12667
 URL: https://issues.apache.org/jira/browse/SPARK-12667
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12420) Have a built-in CSV data source implementation

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12420:


Assignee: Apache Spark

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12420) Have a built-in CSV data source implementation

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12420:


Assignee: (was: Apache Spark)

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2016-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085097#comment-15085097
 ] 

Apache Spark commented on SPARK-12420:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/10615

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11139) Make SparkContext.stop() exception-safe

2016-01-05 Thread Jayadevan M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085135#comment-15085135
 ] 

Jayadevan M commented on SPARK-11139:
-

[~felixcheung]

Is this issue resolved ? If yes can you tell the version ?

> Make SparkContext.stop() exception-safe
> ---
>
> Key: SPARK-11139
> URL: https://issues.apache.org/jira/browse/SPARK-11139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In SparkContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Work has been done in SPARK-4194 to allow for cleanup to partial 
> initialization.
> Similarly issue in StreamingContext SPARK-11137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12454) Add ExpressionDescription to expressions that are registered in FunctionRegistry

2016-01-05 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12454:
-
Summary: Add ExpressionDescription to expressions that are registered in 
FunctionRegistry  (was: Add ExpressionDescription to expressions are registered 
in FunctionRegistry)

> Add ExpressionDescription to expressions that are registered in 
> FunctionRegistry
> 
>
> Key: SPARK-12454
> URL: https://issues.apache.org/jira/browse/SPARK-12454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> ExpressionDescription is a annotation that contains doc of a function and 
> when users use {{describe function}}, users can see the doc defined in this 
> annotation. You can take a look at {{Upper}} as an example. 
> However, we still have lots of expression that do not have 
> ExpressionDescription. It will be great to take a look at expressions 
> registered in FunctionRegistry and add ExpressionDescription to those 
> expression that do not have it..
> A list of expressions (and their categories) registered in function registry 
> can be found at 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L117-L296.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12433) Make parts of the core Spark testing API public to assist developers making their own tests.

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12433:


Assignee: Apache Spark

> Make parts of the core Spark testing API public to assist developers making 
> their own tests.
> 
>
> Key: SPARK-12433
> URL: https://issues.apache.org/jira/browse/SPARK-12433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> See parent JIRA for proposed API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12433) Make parts of the core Spark testing API public to assist developers making their own tests.

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12433:


Assignee: (was: Apache Spark)

> Make parts of the core Spark testing API public to assist developers making 
> their own tests.
> 
>
> Key: SPARK-12433
> URL: https://issues.apache.org/jira/browse/SPARK-12433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> See parent JIRA for proposed API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12433) Make parts of the core Spark testing API public to assist developers making their own tests.

2016-01-05 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12433.
---
Resolution: Won't Fix

Sorry I don't think we should have this. This is bad for API evolution and 
dependency management (Spark runtime shouldn't depend on a specific version of 
scalatest).


> Make parts of the core Spark testing API public to assist developers making 
> their own tests.
> 
>
> Key: SPARK-12433
> URL: https://issues.apache.org/jira/browse/SPARK-12433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> See parent JIRA for proposed API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12436:


Assignee: (was: Apache Spark)

> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> --
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

2016-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12436:


Assignee: Apache Spark

> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> --
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 262 matches

Mail list logo