[jira] [Assigned] (SPARK-13277) ANTLR ignores other rule using the USING keyword

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13277:


Assignee: (was: Apache Spark)

> ANTLR ignores other rule using the USING keyword
> 
>
> Key: SPARK-13277
> URL: https://issues.apache.org/jira/browse/SPARK-13277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Priority: Minor
>
> ANTLR currently emits the following warning during compilation:
> {noformat}
> warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: 
> Decision can match input such as "KW_USING Identifier" using multiple 
> alternatives: 2, 3
> As a result, alternative(s) 3 were disabled for that input
> {noformat}
> This means that some of the functionality of the parser is disabled. This is 
> introduced by the migration of the DDLParsers 
> (https://github.com/apache/spark/pull/10723). We should be able to fix this 
> by introducing a syntactic predicate for USING.
> cc [~viirya]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13277) ANTLR ignores other rule using the USING keyword

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142376#comment-15142376
 ] 

Apache Spark commented on SPARK-13277:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11168

> ANTLR ignores other rule using the USING keyword
> 
>
> Key: SPARK-13277
> URL: https://issues.apache.org/jira/browse/SPARK-13277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Priority: Minor
>
> ANTLR currently emits the following warning during compilation:
> {noformat}
> warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: 
> Decision can match input such as "KW_USING Identifier" using multiple 
> alternatives: 2, 3
> As a result, alternative(s) 3 were disabled for that input
> {noformat}
> This means that some of the functionality of the parser is disabled. This is 
> introduced by the migration of the DDLParsers 
> (https://github.com/apache/spark/pull/10723). We should be able to fix this 
> by introducing a syntactic predicate for USING.
> cc [~viirya]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13277) ANTLR ignores other rule using the USING keyword

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13277:


Assignee: Apache Spark

> ANTLR ignores other rule using the USING keyword
> 
>
> Key: SPARK-13277
> URL: https://issues.apache.org/jira/browse/SPARK-13277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> ANTLR currently emits the following warning during compilation:
> {noformat}
> warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: 
> Decision can match input such as "KW_USING Identifier" using multiple 
> alternatives: 2, 3
> As a result, alternative(s) 3 were disabled for that input
> {noformat}
> This means that some of the functionality of the parser is disabled. This is 
> introduced by the migration of the DDLParsers 
> (https://github.com/apache/spark/pull/10723). We should be able to fix this 
> by introducing a syntactic predicate for USING.
> cc [~viirya]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13270) Improve readability of whole stage codegen by skipping empty lines and outputting the pipeline plan

2016-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13270.
-
   Resolution: Fixed
 Assignee: Nong Li
Fix Version/s: 2.0.0

> Improve readability of whole stage codegen by skipping empty lines and 
> outputting the pipeline plan
> ---
>
> Key: SPARK-13270
> URL: https://issues.apache.org/jira/browse/SPARK-13270
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> It would be nice to comment the generated function with the pipeline it is 
> for, particularly for complex queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13235) Remove an Extra Distinct in Union

2016-02-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-13235.
---
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Remove an Extra Distinct in Union
> -
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Union Distinct has two Distinct that generate two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13281) Switch broadcast of RDD to exception from warning

2016-02-10 Thread holdenk (JIRA)
holdenk created SPARK-13281:
---

 Summary: Switch broadcast of RDD to exception from warning
 Key: SPARK-13281
 URL: https://issues.apache.org/jira/browse/SPARK-13281
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Trivial


In the comments we log a warning when a user tries to broadcast an RDD for 
compatibility with old programs which may have broadcast RDDs without using the 
resulting broadcast variable. Since we're moving to 2.0 it seems like now would 
be a good opportunity to replace that warning with an exception rather than 
depend on the developer finding the warning message.
Related to https://issues.apache.org/jira/browse/SPARK-5063 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13276) Parse Table Identifiers/Expression skips bad characters at the end of the passed string

2016-02-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-13276.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.0.0

> Parse Table Identifiers/Expression skips bad characters at the end of the 
> passed string
> ---
>
> Key: SPARK-13276
> URL: https://issues.apache.org/jira/browse/SPARK-13276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.0
>
>
> Both the ParseDriver.parseTableName/parseExpression methods currently allow 
> the passed command to end with any kind of (bad) characters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13234) Remove duplicated SQL metrics

2016-02-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13234.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11163
[https://github.com/apache/spark/pull/11163]

> Remove duplicated SQL metrics
> -
>
> Key: SPARK-13234
> URL: https://issues.apache.org/jira/browse/SPARK-13234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> For lots of SQL operators, we have metrics for both of input and output, the 
> number of input rows should be exactly the number of output rows of child, we 
> could only have metrics for output rows.
> After we improve the performance using whole stage codegen, the overhead of 
> SQL metrics are not trivial anymore, we should avoid that if it's not 
> necessary.
> Some of the operator does not have SQL metrics, we should add that for them.
> For those operators that have the same number of rows from input and output 
> (for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13061) Error in spark rest api application info for job names contains spaces

2016-02-10 Thread Avihoo Mamka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142318#comment-15142318
 ] 

Avihoo Mamka commented on SPARK-13061:
--

I'm executing this request using Python requests
{code:xml}
http://spark.mysite.com:8088/ws/v1/cluster/apps/?state=RUNNING
{code}

Then I convert my result to json and get this result:

{code:xml}
{u'apps': {u'app': [{u'runningContainers': 61, u'allocatedVCores': 61, 
u'clusterId': 1448371222831, u'amContainerLogs': 
u'http://ip-10-20-1-246:8042/node/containerlogs/container_1448371222831_2557_01_01/hadoop',
 u'id': u'application_1448371222831_2557', u'preemptedResourceMB': 0, 
u'finishedTime': 0, u'numAMContainerPreempted': 0, u'user': u'hadoop', 
u'preemptedResourceVCores': 0, u'startedTime': 1455170737855, u'elapsedTime': 
1769164, u'state': u'RUNNING', u'numNonAMContainerPreempted': 0, u'progress': 
10.0, u'trackingUI': u'ApplicationMaster', u'trackingUrl': 
u'http://ip-10-20-1-104:20888/proxy/application_1448371222831_2557/', 
u'allocatedMB': 553984, u'amHostHttpAddress': u'ip-10-20-1-246:8042', 
u'memorySeconds': 1520893, u'applicationTags': u'', u'name': u'Spark shell', 
u'queue': u'default', u'vcoreSeconds': 134, u'applicationType': u'SPARK', 
u'diagnostics': u'', u'finalStatus': u'UNDEFINED'}]}}
{code}

I then extract the value of key {code:xml}apps{code} and extract the value of 
key {code:xml}app{code} inside. So right now I have this json array:
{code:xml}
[{u'runningContainers': 61, u'allocatedVCores': 61, u'clusterId': 
1448371222831, u'amContainerLogs': 
u'http://ip-10-20-1-246:8042/node/containerlogs/container_1448371222831_2557_01_01/hadoop',
 u'id': u'application_1448371222831_2557', u'preemptedResourceMB': 0, 
u'finishedTime': 0, u'numAMContainerPreempted': 0, u'user': u'hadoop', 
u'preemptedResourceVCores': 0, u'startedTime': 1455170737855, u'elapsedTime': 
1769164, u'state': u'RUNNING', u'numNonAMContainerPreempted': 0, u'progress': 
10.0, u'trackingUI': u'ApplicationMaster', u'trackingUrl': 
u'http://ip-10-20-1-104:20888/proxy/application_1448371222831_2557/', 
u'allocatedMB': 553984, u'amHostHttpAddress': u'ip-10-20-1-246:8042', 
u'memorySeconds': 1520893, u'applicationTags': u'', u'name': u'Spark shell', 
u'queue': u'default', u'vcoreSeconds': 134, u'applicationType': u'SPARK', 
u'diagnostics': u'', u'finalStatus': u'UNDEFINED'}]
{code}

I then run in for loop and for each item in the array, I extract the 
{code:xml}id{code} and {code:xml}name{code}
In the above example it will be: id -> application_1448371222831_2557 and name 
-> Spark shell

Now I execute this rest call:
{code:xml}
http://spark.mysite.com:20888/proxy/application_1448371222831_2557/api/v1/applications/Spark
 shell/jobs/0
{code}

And then I get this result:
{code:xml}
Spark shell Not Found
{code}

> Error in spark rest api application info for job names contains spaces
> --
>
> Key: SPARK-13061
> URL: https://issues.apache.org/jira/browse/SPARK-13061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Avihoo Mamka
>Priority: Trivial
>  Labels: rest_api, spark
>
> When accessing spark rest api with application id to get job specific id 
> status, a job with name containing whitespaces are being encoded to '%20' and 
> therefore the rest api returns `no such app`.
> For example:
> http://spark.mysite.com:20888/proxy/application_1447676402999_1254/api/v1/applications/
>  returns:
> [ {
>   "id" : "Spark shell",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-01-28T09:20:58.526GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "sparkUser" : "",
> "completed" : false
>   } ]
> } ]
> and then when accessing:
> http://spark.mysite.com:20888/proxy/application_1447676402999_1254/api/v1/applications/Spark
>  shell/
> the result returned is:
> unknown app: Spark%20shell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13260) count(*) does not work with CSV data source

2016-02-10 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142295#comment-15142295
 ] 

Thomas Sebastian commented on SPARK-13260:
--

HI [~falaki]  Could you give more details about this issue? Do you see any 
count difference?

> count(*) does not work with CSV data source
> ---
>
> Key: SPARK-13260
> URL: https://issues.apache.org/jira/browse/SPARK-13260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> column pruning CSV data source seems to omit all columns when we run 
> following query:
> {code}
> select count(*) from csvTable
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13279:


Assignee: (was: Apache Spark)

> Spark driver stuck holding a global lock when there are 200k tasks submitted 
> in a stage
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
> Fix For: 1.6.0
>
>
> While running a large pipeline with 200k tasks, we found that the executors 
> were not able to register with the driver because the driver was stuck 
> holding a global lock in TaskSchedulerImpl.submitTasks function. 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13279:


Assignee: Apache Spark

> Spark driver stuck holding a global lock when there are 200k tasks submitted 
> in a stage
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
> Fix For: 1.6.0
>
>
> While running a large pipeline with 200k tasks, we found that the executors 
> were not able to register with the driver because the driver was stuck 
> holding a global lock in TaskSchedulerImpl.submitTasks function. 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142286#comment-15142286
 ] 

Apache Spark commented on SPARK-13279:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/11167

> Spark driver stuck holding a global lock when there are 200k tasks submitted 
> in a stage
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
> Fix For: 1.6.0
>
>
> While running a large pipeline with 200k tasks, we found that the executors 
> were not able to register with the driver because the driver was stuck 
> holding a global lock in TaskSchedulerImpl.submitTasks function. 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-13279:

Description: 
While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution -   In addPendingTask function, we don't really need a duplicate 
check. It's okay if we add a task to the same queue twice because 
dequeueTaskFromList will skip already-running tasks. 

Please note that this is a regression from Spark 1.5.

  was:
While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution -   In addPendingTask function, we don't really need a duplicate 
check. It's okay if we add a task to the same queue twice because 
dequeueTaskFromList will skip already-running tasks.


> Spark driver stuck holding a global lock when there are 200k tasks submitted 
> in a stage
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
> Fix For: 1.6.0
>
>
> While running a large pipeline with 200k tasks, we found that the executors 
> were not able to register with the driver because the driver was stuck 
> holding a global lock in TaskSchedulerImpl.submitTasks function. 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-13279:

Description: 
While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution -   In addPendingTask function, we don't really need a duplicate 
check. It's okay if we add a task to the same queue twice because 
dequeueTaskFromList will skip already-running tasks.

  was:
While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will 
provide us o(1) lookup and also maintain the ordering. 



> Spark driver stuck holding a global lock when there are 200k tasks submitted 
> in a stage
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
> Fix For: 1.6.0
>
>
> While running a large pipeline with 200k tasks, we found that the executors 
> were not able to register with the driver because the driver was stuck 
> holding a global lock in TaskSchedulerImpl.submitTasks function. 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13268) SQL Timestamp stored as GMT but toString returns GMT-08:00

2016-02-10 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142267#comment-15142267
 ] 

Jayadevan M commented on SPARK-13268:
-

@Ilya Ganelin
Can you tell how you import ZonedDateTime in your program and java version ?

> SQL Timestamp stored as GMT but toString returns GMT-08:00
> --
>
> Key: SPARK-13268
> URL: https://issues.apache.org/jira/browse/SPARK-13268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ilya Ganelin
>
> There is an issue with how timestamps are displayed/converted to Strings in 
> Spark SQL. The documentation states that the timestamp should be created in 
> the GMT time zone, however, if we do so, we see that the output actually 
> contains a -8 hour offset:
> {code}
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
> res144: java.sql.Timestamp = 2014-12-31 16:00:00.0
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
> res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
> {code}
> This result is confusing, unintuitive, and introduces issues when converting 
> from DataFrames containing timestamps to RDDs which are then saved as text. 
> This has the effect of essentially shifting all dates in a dataset by 1 day. 
> The suggested fix for this is to update the timestamp toString representation 
> to either a) Include timezone or b) Correctly display in GMT.
> This change may well introduce substantial and insidious bugs so I'm not sure 
> how best to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12706) support grouping/grouping_id function together group set

2016-02-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12706.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10677
[https://github.com/apache/spark/pull/10677]

> support grouping/grouping_id function together group set
> 
>
> Key: SPARK-12706
> URL: https://issues.apache.org/jira/browse/SPARK-12706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction
> http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13205) SQL generation support for self join

2016-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13205:
---
Assignee: Xiao Li

> SQL generation support for self join
> 
>
> Key: SPARK-13205
> URL: https://issues.apache.org/jira/browse/SPARK-13205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> SQL generation does not support the Self Join.
> {code}SELECT x.key FROM t1 x JOIN t1 y ON x.key = y.key{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13205) SQL generation support for self join

2016-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13205.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11084
[https://github.com/apache/spark/pull/11084]

> SQL generation support for self join
> 
>
> Key: SPARK-13205
> URL: https://issues.apache.org/jira/browse/SPARK-13205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> SQL generation does not support the Self Join.
> {code}SELECT x.key FROM t1 x JOIN t1 y ON x.key = y.key{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13280) FileBasedWriteAheadLog logger name should be under o.a.s namespace

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13280:


Assignee: (was: Apache Spark)

> FileBasedWriteAheadLog logger name should be under o.a.s namespace
> --
>
> Key: SPARK-13280
> URL: https://issues.apache.org/jira/browse/SPARK-13280
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The logger name in FileBasedWriteAheadLog is currently defined as:
> {code}
>   override protected val logName = s"WriteAheadLogManager $callerNameTag"
> {code}
> That has two problems:
> - It's not under the usual "org.apache.spark" namespace so changing the 
> logging configuration for that package does not affect it
> - we've seen cases where {{$callerNameTag}} was empty, in which case the 
> logger name would have a trailing space, making it impossible to disable it 
> using a properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13280) FileBasedWriteAheadLog logger name should be under o.a.s namespace

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142179#comment-15142179
 ] 

Apache Spark commented on SPARK-13280:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11165

> FileBasedWriteAheadLog logger name should be under o.a.s namespace
> --
>
> Key: SPARK-13280
> URL: https://issues.apache.org/jira/browse/SPARK-13280
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The logger name in FileBasedWriteAheadLog is currently defined as:
> {code}
>   override protected val logName = s"WriteAheadLogManager $callerNameTag"
> {code}
> That has two problems:
> - It's not under the usual "org.apache.spark" namespace so changing the 
> logging configuration for that package does not affect it
> - we've seen cases where {{$callerNameTag}} was empty, in which case the 
> logger name would have a trailing space, making it impossible to disable it 
> using a properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13280) FileBasedWriteAheadLog logger name should be under o.a.s namespace

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13280:


Assignee: Apache Spark

> FileBasedWriteAheadLog logger name should be under o.a.s namespace
> --
>
> Key: SPARK-13280
> URL: https://issues.apache.org/jira/browse/SPARK-13280
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> The logger name in FileBasedWriteAheadLog is currently defined as:
> {code}
>   override protected val logName = s"WriteAheadLogManager $callerNameTag"
> {code}
> That has two problems:
> - It's not under the usual "org.apache.spark" namespace so changing the 
> logging configuration for that package does not affect it
> - we've seen cases where {{$callerNameTag}} was empty, in which case the 
> logger name would have a trailing space, making it impossible to disable it 
> using a properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13280) FileBasedWriteAheadLog logger name should be under o.a.s namespace

2016-02-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13280:
--

 Summary: FileBasedWriteAheadLog logger name should be under o.a.s 
namespace
 Key: SPARK-13280
 URL: https://issues.apache.org/jira/browse/SPARK-13280
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


The logger name in FileBasedWriteAheadLog is currently defined as:

{code}
  override protected val logName = s"WriteAheadLogManager $callerNameTag"
{code}

That has two problems:

- It's not under the usual "org.apache.spark" namespace so changing the logging 
configuration for that package does not affect it
- we've seen cases where {{$callerNameTag}} was empty, in which case the logger 
name would have a trailing space, making it impossible to disable it using a 
properties file.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12725.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11050
[https://github.com/apache/spark/pull/11050]

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.
> Here's an example Spark 1.6.0 snippet for illustration:
> {code}
> sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
> sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
> COUNT(b)").explain(true)
> {code}
> The above code produces the following resolved plan:
> {noformat}
> == Analyzed Logical Plan ==
> _c0: bigint
> Project [_c0#101L]
> +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
>+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
> AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
> aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
>   +- Subquery t
>  +- Project [id#46L AS a#47L,id#46L AS b#48L]
> +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
> :26
> {noformat}
> Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
> into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
> different expression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-02-10 Thread Sven Krasser (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142120#comment-15142120
 ] 

Sven Krasser commented on SPARK-12675:
--

I'm running into the same issue (same exception) running locally using Spark 
1.6.0. There are just 200 partitions in my case. My job fails in local mode, 
but it eventually completes in local\[*\] mode (but in either case the 
exception occurs during processing).

[~josephkb], any suggestions on where to go from here -- reopen? The 
{{ClassCastException}} certainly looks suspicious.

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Ex

[jira] [Updated] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13274:

Fix Version/s: 1.6.1

> Fix Aggregator Links on GroupedDataset Scala API 
> -
>
> Key: SPARK-13274
> URL: https://issues.apache.org/jira/browse/SPARK-13274
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Raela Wang
>Assignee: Raela Wang
>Priority: Trivial
> Fix For: 1.6.1, 2.0.0
>
>
> Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
> mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13274.
-
   Resolution: Fixed
 Assignee: Raela Wang
Fix Version/s: 2.0.0

> Fix Aggregator Links on GroupedDataset Scala API 
> -
>
> Key: SPARK-13274
> URL: https://issues.apache.org/jira/browse/SPARK-13274
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Raela Wang
>Assignee: Raela Wang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
> mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-13279:

Description: 
While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will 
provide us o(1) lookup and also maintain the ordering. 


  was:
While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop" which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will 
provide us o(1) lookup and also maintain the ordering. 



> Spark driver stuck holding a global lock when there are 200k tasks submitted 
> in a stage
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
> Fix For: 1.6.0
>
>
> While running a large pipeline with 200k tasks, we found that the executors 
> were not able to register with the driver because the driver was stuck 
> holding a global lock in TaskSchedulerImpl.submitTasks function. 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will 
> provide us o(1) lookup and also maintain the ordering. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13279) Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage

2016-02-10 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-13279:
---

 Summary: Spark driver stuck holding a global lock when there are 
200k tasks submitted in a stage
 Key: SPARK-13279
 URL: https://issues.apache.org/jira/browse/SPARK-13279
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Sital Kedia
 Fix For: 1.6.0


While running a large pipeline with 200k tasks, we found that the executors 
were not able to register with the driver because the driver was stuck holding 
a global lock in TaskSchedulerImpl.submitTasks function. 

jstack of the driver - http://pastebin.com/m8CP6VMv

executor log - http://pastebin.com/2NPS1mXC


>From the jstack I see that the thread handing the resource offer from 
>executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
>"dag-scheduler-event-loop" which is iterating over an entire ArrayBuffer when 
>adding a pending tasks. So when we have 200k pending tasks, because of this 
>o(n2) operations, the driver is just hung for more than 5 minutes. 

Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will 
provide us o(1) lookup and also maintain the ordering. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-10 Thread Lin Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141998#comment-15141998
 ] 

Lin Zhao commented on SPARK-13069:
--

There also seems to be no way to specify bounded blocking mailbox for 
ActorReceiverSupervisor, which would solve this. Only way I can think of is 
adding a storeSync to ActorReceiver.

> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-10 Thread Lin Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141992#comment-15141992
 ] 

Lin Zhao commented on SPARK-13069:
--

I haven't tested with the master code but looking at the source it almost 
certain has the same issue.

> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13234) Remove duplicated SQL metrics

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141930#comment-15141930
 ] 

Apache Spark commented on SPARK-13234:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11163

> Remove duplicated SQL metrics
> -
>
> Key: SPARK-13234
> URL: https://issues.apache.org/jira/browse/SPARK-13234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> For lots of SQL operators, we have metrics for both of input and output, the 
> number of input rows should be exactly the number of output rows of child, we 
> could only have metrics for output rows.
> After we improve the performance using whole stage codegen, the overhead of 
> SQL metrics are not trivial anymore, we should avoid that if it's not 
> necessary.
> Some of the operator does not have SQL metrics, we should add that for them.
> For those operators that have the same number of rows from input and output 
> (for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13234) Remove duplicated SQL metrics

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13234:


Assignee: Apache Spark

> Remove duplicated SQL metrics
> -
>
> Key: SPARK-13234
> URL: https://issues.apache.org/jira/browse/SPARK-13234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> For lots of SQL operators, we have metrics for both of input and output, the 
> number of input rows should be exactly the number of output rows of child, we 
> could only have metrics for output rows.
> After we improve the performance using whole stage codegen, the overhead of 
> SQL metrics are not trivial anymore, we should avoid that if it's not 
> necessary.
> Some of the operator does not have SQL metrics, we should add that for them.
> For those operators that have the same number of rows from input and output 
> (for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13234) Remove duplicated SQL metrics

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13234:


Assignee: (was: Apache Spark)

> Remove duplicated SQL metrics
> -
>
> Key: SPARK-13234
> URL: https://issues.apache.org/jira/browse/SPARK-13234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> For lots of SQL operators, we have metrics for both of input and output, the 
> number of input rows should be exactly the number of output rows of child, we 
> could only have metrics for output rows.
> After we improve the performance using whole stage codegen, the overhead of 
> SQL metrics are not trivial anymore, we should avoid that if it's not 
> necessary.
> Some of the operator does not have SQL metrics, we should add that for them.
> For those operators that have the same number of rows from input and output 
> (for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13149) Add FileStreamSource

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141917#comment-15141917
 ] 

Apache Spark commented on SPARK-13149:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11162

> Add FileStreamSource
> 
>
> Key: SPARK-13149
> URL: https://issues.apache.org/jira/browse/SPARK-13149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13262) cannot coerce type 'environment' to vector of type 'list'

2016-02-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141877#comment-15141877
 ] 

Shivaram Venkataraman commented on SPARK-13262:
---

Can you paste the code you ran that led to the error ? It'll be great if there 
is a small reproducible example

> cannot coerce type 'environment' to vector of type 'list'
> -
>
> Key: SPARK-13262
> URL: https://issues.apache.org/jira/browse/SPARK-13262
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Samuel Alexander
>
> Occasionally getting the following error while using Spark R while 
> constructing dataframe in R
> 16/02/09 13:28:06 WARN RBackendHandler: cannot find matching method class 
> org.apache.spark.sql.api.r.SQLUtils.dfToCols. Candidates are:
> Error in as.vector(x, "list") : 
>   cannot coerce type 'environment' to vector of type 'list'
> Restarting SparkR fixed the error.
> What is the cause for this issue? How can we solve it? 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141868#comment-15141868
 ] 

Apache Spark commented on SPARK-13278:
--

User 'cl4es' has created a pull request for this issue:
https://github.com/apache/spark/pull/11160

> Launcher fails to start with JDK 9 EA
> -
>
> Key: SPARK-13278
> URL: https://issues.apache.org/jira/browse/SPARK-13278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Claes Redestad
>
> CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
> format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13278:


Assignee: (was: Apache Spark)

> Launcher fails to start with JDK 9 EA
> -
>
> Key: SPARK-13278
> URL: https://issues.apache.org/jira/browse/SPARK-13278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Claes Redestad
>
> CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
> format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13278:


Assignee: Apache Spark

> Launcher fails to start with JDK 9 EA
> -
>
> Key: SPARK-13278
> URL: https://issues.apache.org/jira/browse/SPARK-13278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Claes Redestad
>Assignee: Apache Spark
>
> CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
> format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9438) restarting leader zookeeper causes spark master to die when the spark master election is assigned to zookeeper

2016-02-10 Thread Thomas Demoor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141835#comment-15141835
 ] 

Thomas Demoor commented on SPARK-9438:
--

We have witnessed this as well in 1.3. Losing the ZK leader takes down the 
active spark master. 



> restarting leader zookeeper causes spark master to die when the spark master 
> election is assigned to zookeeper
> --
>
> Key: SPARK-9438
> URL: https://issues.apache.org/jira/browse/SPARK-9438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Saprk 1.2.0 and Zookeeper version: 3.4.6-1569965
>Reporter: Amir Rad
>
> When Spark Master Election is assigned to Zookeeper, restarting the leader 
> Zookeeper causes the master spark to die. 
> Steps to reproduce:
> create a cluster of 3 spark nodes. 
> set Spark-env to:
> SPARK_LOCAL_DIRS="/home/sparkcde/data_spark/data"
> SPARK_MASTER_OPTS="-Dspark.deploy.spreadOut=false"
> SPARK_WORKER_DIR="/home/sparkcde/data_spark/worker"
> SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=s1:2181,s2:2181,s3:2181"
> Identify the spark master
> identify the zookeeper leader. 
> Stop zookeeper leader
> check spark master: It is dead
> start zookeeper leader
> check spark master: still dead
> If you continue the same pattern of stopping and starting zookeeper leader, 
> eventually you will lose the whole spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13278) Launcher fails to start with JDK 9 EA

2016-02-10 Thread Claes Redestad (JIRA)
Claes Redestad created SPARK-13278:
--

 Summary: Launcher fails to start with JDK 9 EA
 Key: SPARK-13278
 URL: https://issues.apache.org/jira/browse/SPARK-13278
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Claes Redestad


CommandBuilderUtils.addPermGenSizeOpt need to handle the JDK 9 version string 
format, which can look like the expected 9, but also like 9-ea and 9+100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13056) Map column would throw NPE if value is null

2016-02-10 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-13056:
-
Assignee: Adrian Wang

> Map column would throw NPE if value is null
> ---
>
> Key: SPARK-13056
> URL: https://issues.apache.org/jira/browse/SPARK-13056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
> Fix For: 1.6.1, 2.0.0
>
>
> Create a map like
> { "a": "somestring",
>   "b": null}
> Query like
> SELECT col["b"] FROM t1;
> NPE would be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13056) Map column would throw NPE if value is null

2016-02-10 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-13056.
--
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1

[~marmbrus] [~adrian-wang] Looks like the jira wasn't updated when this was 
merged, I'm doing it manually now -- please update if I've made a mistake.

Issue Resolved by pull request 10964
https://github.com/apache/spark/pull/10964

> Map column would throw NPE if value is null
> ---
>
> Key: SPARK-13056
> URL: https://issues.apache.org/jira/browse/SPARK-13056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
> Fix For: 1.6.1, 2.0.0
>
>
> Create a map like
> { "a": "somestring",
>   "b": null}
> Query like
> SELECT col["b"] FROM t1;
> NPE would be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2016-02-10 Thread Ignacio tartavull (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141761#comment-15141761
 ] 

Ignacio tartavull commented on SPARK-3789:
--

Is there any update in the status of this ticket?

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13277) ANTLR ignores other rule using the USING keyword

2016-02-10 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-13277:
-

 Summary: ANTLR ignores other rule using the USING keyword
 Key: SPARK-13277
 URL: https://issues.apache.org/jira/browse/SPARK-13277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Herman van Hovell
Priority: Minor


ANTLR currently emits the following warning during compilation:
{noformat}
warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: 
Decision can match input such as "KW_USING Identifier" using multiple 
alternatives: 2, 3

As a result, alternative(s) 3 were disabled for that input
{noformat}

This means that some of the functionality of the parser is disabled. This is 
introduced by the migration of the DDLParsers 
(https://github.com/apache/spark/pull/10723). We should be able to fix this by 
introducing a syntactic predicate for USING.

cc [~viirya]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13276) Parse Table Identifiers/Expression skips bad characters at the end of the passed string

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13276:


Assignee: (was: Apache Spark)

> Parse Table Identifiers/Expression skips bad characters at the end of the 
> passed string
> ---
>
> Key: SPARK-13276
> URL: https://issues.apache.org/jira/browse/SPARK-13276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Priority: Minor
>
> Both the ParseDriver.parseTableName/parseExpression methods currently allow 
> the passed command to end with any kind of (bad) characters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13276) Parse Table Identifiers/Expression skips bad characters at the end of the passed string

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13276:


Assignee: Apache Spark

> Parse Table Identifiers/Expression skips bad characters at the end of the 
> passed string
> ---
>
> Key: SPARK-13276
> URL: https://issues.apache.org/jira/browse/SPARK-13276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> Both the ParseDriver.parseTableName/parseExpression methods currently allow 
> the passed command to end with any kind of (bad) characters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13276) Parse Table Identifiers/Expression skips bad characters at the end of the passed string

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141733#comment-15141733
 ] 

Apache Spark commented on SPARK-13276:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/11159

> Parse Table Identifiers/Expression skips bad characters at the end of the 
> passed string
> ---
>
> Key: SPARK-13276
> URL: https://issues.apache.org/jira/browse/SPARK-13276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Priority: Minor
>
> Both the ParseDriver.parseTableName/parseExpression methods currently allow 
> the passed command to end with any kind of (bad) characters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13057) Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation

2016-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13057.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.0.0

> Add benchmark codes and the performance results for implemented compression 
> schemes for InMemoryRelation
> 
>
> Key: SPARK-13057
> URL: https://issues.apache.org/jira/browse/SPARK-13057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.0
>
>
> This ticket adds benchmark codes for in-memory cache compression to make 
> future developments and discussions more smooth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12414) Remove closure serializer

2016-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12414.
-
   Resolution: Fixed
 Assignee: Sean Owen  (was: Andrew Or)
Fix Version/s: 2.0.0

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Sean Owen
> Fix For: 2.0.0
>
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13276) Parse Table Identifiers/Expression skips bad characters at the end of the passed string

2016-02-10 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-13276:
-

 Summary: Parse Table Identifiers/Expression skips bad characters 
at the end of the passed string
 Key: SPARK-13276
 URL: https://issues.apache.org/jira/browse/SPARK-13276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Herman van Hovell
Priority: Minor


Both the ParseDriver.parseTableName/parseExpression methods currently allow the 
passed command to end with any kind of (bad) characters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141656#comment-15141656
 ] 

Apache Spark commented on SPARK-13274:
--

User 'raelawang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11158

> Fix Aggregator Links on GroupedDataset Scala API 
> -
>
> Key: SPARK-13274
> URL: https://issues.apache.org/jira/browse/SPARK-13274
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Raela Wang
>Priority: Trivial
>
> Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
> mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141653#comment-15141653
 ] 

Apache Spark commented on SPARK-11714:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/11157

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null

2016-02-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141619#comment-15141619
 ] 

Shixiong Zhu commented on SPARK-13266:
--

Could you submit a PR?

> Python DataFrameReader converts None to "None" instead of null
> --
>
> Key: SPARK-13266
> URL: https://issues.apache.org/jira/browse/SPARK-13266
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Linux standalone but probably applies to all
>Reporter: mathieu longtin
>  Labels: easyfix, patch
>
> If you do something like this:
> {code:none}
> tsv_loader = sqlContext.read.format('com.databricks.spark.csv')
> tsv_loader.options(quote=None, escape=None)
> {code}
> The loader sees the string "None" as the _quote_ and _escape_ options. The 
> loader should get a _null_.
> An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, 
> correct the _to_str_ function. Here's the patch:
> {code:none}
> diff --git a/python/pyspark/sql/readwriter.py 
> b/python/pyspark/sql/readwriter.py
> index a3d7eca..ba18d13 100644
> --- a/python/pyspark/sql/readwriter.py
> +++ b/python/pyspark/sql/readwriter.py
> @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"]
>  def to_str(value):
>  """
> -A wrapper over str(), but convert bool values to lower case string
> +A wrapper over str(), but convert bool values to lower case string, and 
> keep None
>  """
>  if isinstance(value, bool):
>  return str(value).lower()
> +elif value is None:
> +return value
>  else:
>  return str(value)
> {code}
> This has been tested and works great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-13271:


Assignee: Shixiong Zhu

> Better error message if 'path' is not specified
> ---
>
> Key: SPARK-13271
> URL: https://issues.apache.org/jira/browse/SPARK-13271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> As per discussion in 
> https://github.com/apache/spark/pull/11034#discussion_r52111238
> we should improve the error message:
> {code}
> scala> sqlContext.read.format("text").load()
> java.util.NoSuchElementException: key not found: path
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13271:
-
Component/s: SQL

> Better error message if 'path' is not specified
> ---
>
> Key: SPARK-13271
> URL: https://issues.apache.org/jira/browse/SPARK-13271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>
> As per discussion in 
> https://github.com/apache/spark/pull/11034#discussion_r52111238
> we should improve the error message:
> {code}
> scala> sqlContext.read.format("text").load()
> java.util.NoSuchElementException: key not found: path
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13271:
-
Issue Type: Improvement  (was: Bug)

> Better error message if 'path' is not specified
> ---
>
> Key: SPARK-13271
> URL: https://issues.apache.org/jira/browse/SPARK-13271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>
> As per discussion in 
> https://github.com/apache/spark/pull/11034#discussion_r52111238
> we should improve the error message:
> {code}
> scala> sqlContext.read.format("text").load()
> java.util.NoSuchElementException: key not found: path
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141607#comment-15141607
 ] 

Apache Spark commented on SPARK-13274:
--

User 'raelawang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11156

> Fix Aggregator Links on GroupedDataset Scala API 
> -
>
> Key: SPARK-13274
> URL: https://issues.apache.org/jira/browse/SPARK-13274
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Raela Wang
>Priority: Trivial
>
> Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
> mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13274:


Assignee: Apache Spark

> Fix Aggregator Links on GroupedDataset Scala API 
> -
>
> Key: SPARK-13274
> URL: https://issues.apache.org/jira/browse/SPARK-13274
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Raela Wang
>Assignee: Apache Spark
>Priority: Trivial
>
> Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
> mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13274:


Assignee: (was: Apache Spark)

> Fix Aggregator Links on GroupedDataset Scala API 
> -
>
> Key: SPARK-13274
> URL: https://issues.apache.org/jira/browse/SPARK-13274
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Raela Wang
>Priority: Trivial
>
> Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
> mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-02-10 Thread Stephanie Bodoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephanie Bodoff updated SPARK-13275:
-
Attachment: webui.png

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Priority: Minor
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-02-10 Thread Stephanie Bodoff (JIRA)
Stephanie Bodoff created SPARK-13275:


 Summary: With dynamic allocation, executors appear to be added 
before job starts
 Key: SPARK-13275
 URL: https://issues.apache.org/jira/browse/SPARK-13275
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Stephanie Bodoff
Priority: Minor
 Attachments: webui.png

When I look at the timeline in the Spark Web UI I see the job starting and then 
executors being added. The blue lines and dots hitting the timeline show that 
the executors were added after the job started. But the way the Executor box is 
rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13126) History Server page always has horizontal scrollbar

2016-02-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-13126.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> History Server page always has horizontal scrollbar
> ---
>
> Key: SPARK-13126
> URL: https://issues.apache.org/jira/browse/SPARK-13126
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Zhuo Liu
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: page_width.png
>
>
> The new History Server page table is always wider than the page no matter how 
> much larger you make the window. Most likely an odd CSS error, doesn't seem 
> to be to be a simple fix when manipulating the css using the Web Inspector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2016-02-10 Thread Oscar Boykin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141589#comment-15141589
 ] 

Oscar Boykin commented on SPARK-11416:
--

Related issue: https://issues.apache.org/jira/browse/STORM-1537

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13126) History Server page always has horizontal scrollbar

2016-02-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-13126:
--
Assignee: Zhuo Liu

> History Server page always has horizontal scrollbar
> ---
>
> Key: SPARK-13126
> URL: https://issues.apache.org/jira/browse/SPARK-13126
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Zhuo Liu
>Priority: Minor
> Attachments: page_width.png
>
>
> The new History Server page table is always wider than the page no matter how 
> much larger you make the window. Most likely an odd CSS error, doesn't seem 
> to be to be a simple fix when manipulating the css using the Web Inspector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13274) Fix Aggregator Links on GroupedDataset Scala API

2016-02-10 Thread Raela Wang (JIRA)
Raela Wang created SPARK-13274:
--

 Summary: Fix Aggregator Links on GroupedDataset Scala API 
 Key: SPARK-13274
 URL: https://issues.apache.org/jira/browse/SPARK-13274
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Raela Wang
Priority: Trivial


Update Scala API docs for GroupedDataset. Links in flatMapGroups() and 
mapGroups() are pointing to the wrong Aggregator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13273) Improve test coverage of CatalystQl

2016-02-10 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-13273:
-

 Summary: Improve test coverage of CatalystQl
 Key: SPARK-13273
 URL: https://issues.apache.org/jira/browse/SPARK-13273
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell


The current CatalystQl tests are quite basic and are far from complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13272) Clean-up CatalystQl

2016-02-10 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-13272:
-

 Summary: Clean-up CatalystQl
 Key: SPARK-13272
 URL: https://issues.apache.org/jira/browse/SPARK-13272
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell


We still have some technical debt in CatalystQl:
* It should be placed in the parser package.
* Most of the methods are lacking proper documentation.
* Some code (regexes) could be moved into an Object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13270) Improve readability of whole stage codegen by skipping empty lines and outputting the pipeline plan

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13270:


Assignee: Apache Spark

> Improve readability of whole stage codegen by skipping empty lines and 
> outputting the pipeline plan
> ---
>
> Key: SPARK-13270
> URL: https://issues.apache.org/jira/browse/SPARK-13270
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>Assignee: Apache Spark
>
> It would be nice to comment the generated function with the pipeline it is 
> for, particularly for complex queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13270) Improve readability of whole stage codegen by skipping empty lines and outputting the pipeline plan

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141571#comment-15141571
 ] 

Apache Spark commented on SPARK-13270:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/11155

> Improve readability of whole stage codegen by skipping empty lines and 
> outputting the pipeline plan
> ---
>
> Key: SPARK-13270
> URL: https://issues.apache.org/jira/browse/SPARK-13270
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>
> It would be nice to comment the generated function with the pipeline it is 
> for, particularly for complex queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13270) Improve readability of whole stage codegen by skipping empty lines and outputting the pipeline plan

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13270:


Assignee: (was: Apache Spark)

> Improve readability of whole stage codegen by skipping empty lines and 
> outputting the pipeline plan
> ---
>
> Key: SPARK-13270
> URL: https://issues.apache.org/jira/browse/SPARK-13270
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>
> It would be nice to comment the generated function with the pipeline it is 
> for, particularly for complex queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13163) Column width on new History Server DataTables not getting set correctly

2016-02-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-13163.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Column width on new History Server DataTables not getting set correctly
> ---
>
> Key: SPARK-13163
> URL: https://issues.apache.org/jira/browse/SPARK-13163
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: page_width_fixed.png, width_long_name.png
>
>
> The column width on the DataTable UI for the History Server is being set for 
> all entries in the table not just the current page. This means if there is 
> even one App with a long name in your history the table will look really odd 
> as seen below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141552#comment-15141552
 ] 

Apache Spark commented on SPARK-13271:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11154

> Better error message if 'path' is not specified
> ---
>
> Key: SPARK-13271
> URL: https://issues.apache.org/jira/browse/SPARK-13271
> Project: Spark
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>
> As per discussion in 
> https://github.com/apache/spark/pull/11034#discussion_r52111238
> we should improve the error message:
> {code}
> scala> sqlContext.read.format("text").load()
> java.util.NoSuchElementException: key not found: path
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13271:


Assignee: (was: Apache Spark)

> Better error message if 'path' is not specified
> ---
>
> Key: SPARK-13271
> URL: https://issues.apache.org/jira/browse/SPARK-13271
> Project: Spark
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>
> As per discussion in 
> https://github.com/apache/spark/pull/11034#discussion_r52111238
> we should improve the error message:
> {code}
> scala> sqlContext.read.format("text").load()
> java.util.NoSuchElementException: key not found: path
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13271:


Assignee: Apache Spark

> Better error message if 'path' is not specified
> ---
>
> Key: SPARK-13271
> URL: https://issues.apache.org/jira/browse/SPARK-13271
> Project: Spark
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> As per discussion in 
> https://github.com/apache/spark/pull/11034#discussion_r52111238
> we should improve the error message:
> {code}
> scala> sqlContext.read.format("text").load()
> java.util.NoSuchElementException: key not found: path
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13271) Better error message if 'path' is not specified

2016-02-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13271:


 Summary: Better error message if 'path' is not specified
 Key: SPARK-13271
 URL: https://issues.apache.org/jira/browse/SPARK-13271
 Project: Spark
  Issue Type: Bug
Reporter: Shixiong Zhu


As per discussion in 
https://github.com/apache/spark/pull/11034#discussion_r52111238
we should improve the error message:

{code}
scala> sqlContext.read.format("text").load()
java.util.NoSuchElementException: key not found: path
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at 
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:159)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at 
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:159)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:129)
  ... 49 elided
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13054) Always post TaskEnd event for tasks in cancelled stages

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141520#comment-15141520
 ] 

Apache Spark commented on SPARK-13054:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/10951

> Always post TaskEnd event for tasks in cancelled stages
> ---
>
> Key: SPARK-13054
> URL: https://issues.apache.org/jira/browse/SPARK-13054
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // The success case is dealt with separately below.
> // TODO: Why post it only for failed tasks in cancelled stages? Clarify 
> semantics here.
> if (event.reason != Success) {
>   val attemptId = task.stageAttemptId
>   listenerBus.post(SparkListenerTaskEnd(
> stageId, attemptId, taskType, event.reason, event.taskInfo, 
> taskMetrics))
> }
> {code}
> Today we only post task end events for canceled stages if the task failed. 
> There is no reason why we shouldn't just post it for all the tasks, including 
> the ones that succeeded. If we do that we will be able to simplify another 
> branch in the DAGScheduler, which needs a lot of simplification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13270) Improve readability of whole stage codegen by skipping empty lines and outputting the pipeline plan

2016-02-10 Thread Nong Li (JIRA)
Nong Li created SPARK-13270:
---

 Summary: Improve readability of whole stage codegen by skipping 
empty lines and outputting the pipeline plan
 Key: SPARK-13270
 URL: https://issues.apache.org/jira/browse/SPARK-13270
 Project: Spark
  Issue Type: Bug
Reporter: Nong Li


It would be nice to comment the generated function with the pipeline it is for, 
particularly for complex queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13269) Expose more executor stats in stable status API

2016-02-10 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13269:
-

 Summary: Expose more executor stats in stable status API
 Key: SPARK-13269
 URL: https://issues.apache.org/jira/browse/SPARK-13269
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or


Currently the stable status API is quite limited; it exposes only a small 
subset of the things exposed by JobProgressListener. It is useful for very high 
level querying but falls short when the developer wants to build an application 
on top of Spark with more integration.

In this issue I propose that we expose at least two things:
- Which executors are running tasks, and
- Which executors cached how much in memory and on disk

The goal is not to expose exactly these two things, but to expose something 
that would allow the developer to learn about them. These concepts are very 
much fundamental in Spark's design so there's almost no chance that they will 
go away in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13269) Expose more executor stats in stable status API

2016-02-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13269:
--
Issue Type: Improvement  (was: Bug)

> Expose more executor stats in stable status API
> ---
>
> Key: SPARK-13269
> URL: https://issues.apache.org/jira/browse/SPARK-13269
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Or
>
> Currently the stable status API is quite limited; it exposes only a small 
> subset of the things exposed by JobProgressListener. It is useful for very 
> high level querying but falls short when the developer wants to build an 
> application on top of Spark with more integration.
> In this issue I propose that we expose at least two things:
> - Which executors are running tasks, and
> - Which executors cached how much in memory and on disk
> The goal is not to expose exactly these two things, but to expose something 
> that would allow the developer to learn about them. These concepts are very 
> much fundamental in Spark's design so there's almost no chance that they will 
> go away in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13254) Fix planning of TakeOrderedAndProject operator

2016-02-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13254.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11145
[https://github.com/apache/spark/pull/11145]

> Fix planning of TakeOrderedAndProject operator
> --
>
> Key: SPARK-13254
> URL: https://issues.apache.org/jira/browse/SPARK-13254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit") 
> inadvertently broke the planning of the TakeOrderedAndProject operator: 
> because ReturnAnswer was the new root of the query plan, the 
> TakeOrderedAndProject rule was unable to match before BasicOperators. We 
> should fix this by moving all rules that match on ReturnAnswer to run at the 
> start of the physical planning process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2016-02-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-5095.
--
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Support launching multiple mesos executors in coarse grained mesos mode
> ---
>
> Key: SPARK-5095
> URL: https://issues.apache.org/jira/browse/SPARK-5095
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 2.0.0
>
>
> Currently in coarse grained mesos mode, it's expected that we only launch one 
> Mesos executor that launches one JVM process to launch multiple spark 
> executors.
> However, this become a problem when the JVM process launched is larger than 
> an ideal size (30gb is recommended value from databricks), which causes GC 
> problems reported on the mailing list.
> We should support launching mulitple executors when large enough resources 
> are available for spark to use, and these resources are still under the 
> configured limit.
> This is also applicable when users want to specifiy number of executors to be 
> launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13174) Add API and options for csv data sources

2016-02-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141438#comment-15141438
 ] 

Davies Liu commented on SPARK-13174:


[~GayathriMurali] Yes, there is a way, but it's not as good as other builtin 
datasources (like parquet, json, jdbc)

> Add API and options for csv data sources
> 
>
> Key: SPARK-13174
> URL: https://issues.apache.org/jira/browse/SPARK-13174
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Reporter: Davies Liu
>
> We should have a API to load csv data source (with some options as 
> arguments), similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12705:


Assignee: Apache Spark  (was: Davies Liu)

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> The following query can't be resolved:
> {code}
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12705:


Assignee: Davies Liu  (was: Apache Spark)

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The following query can't be resolved:
> {code}
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141387#comment-15141387
 ] 

Apache Spark commented on SPARK-12705:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11153

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The following query can't be resolved:
> {code}
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-02-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-12705:

  Assignee: Davies Liu  (was: Xiao Li)

The Q98 is still can't be analyzed, I will send a PR to fix that.

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The following query can't be resolved:
> {code}
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13061) Error in spark rest api application info for job names contains spaces

2016-02-10 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141340#comment-15141340
 ] 

Devaraj K commented on SPARK-13061:
---


You have mentioned the id as 'Spark shell' in the issue description, I don't 
think that is the way the API returns.

{code:xml}
http://spark.mysite.com:20888/proxy/application_1447676402999_1254/api/v1/applications/
 returns:
[ {
"id" : "Spark shell",
"name" : "Spark shell",
{code}

If we are requesting HTTP server with some URL which is having spaces using any 
browser or any other client then the browser/client would encode the URL(as 
part of encoding it replaces spaces with %20) before sending the request to the 
HTTP Server. This is happening when you are passing id as "Spark shell".

{code:xml}/applications/[app-id]/jobs/[job-id]  Details for the given job{code}

I think you need to pass the job-id if you want to get details for the specific 
job not the name.

> Error in spark rest api application info for job names contains spaces
> --
>
> Key: SPARK-13061
> URL: https://issues.apache.org/jira/browse/SPARK-13061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Avihoo Mamka
>Priority: Trivial
>  Labels: rest_api, spark
>
> When accessing spark rest api with application id to get job specific id 
> status, a job with name containing whitespaces are being encoded to '%20' and 
> therefore the rest api returns `no such app`.
> For example:
> http://spark.mysite.com:20888/proxy/application_1447676402999_1254/api/v1/applications/
>  returns:
> [ {
>   "id" : "Spark shell",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-01-28T09:20:58.526GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "sparkUser" : "",
> "completed" : false
>   } ]
> } ]
> and then when accessing:
> http://spark.mysite.com:20888/proxy/application_1447676402999_1254/api/v1/applications/Spark
>  shell/
> the result returned is:
> unknown app: Spark%20shell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13253) Error aliasing array columns.

2016-02-10 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141252#comment-15141252
 ] 

kevin yu commented on SPARK-13253:
--

I can recreate the problem, I am looking at it now

> Error aliasing array columns.
> -
>
> Key: SPARK-13253
> URL: https://issues.apache.org/jira/browse/SPARK-13253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Rakesh Chalasani
>
> Getting an "UnsupportedOperationException" when trying to alias an
> array column. 
> The issue seems over "toString" on Column. "CreateArray" expression -> 
> dataType, which checks for nullability of its children, while aliasing is 
> creating a PrettyAttribute that does not implement nullability.
> Code to reproduce the error:
> {code}
> import org.apache.spark.sql.SQLContext 
> val sqlContext = new SQLContext(sparkContext) 
> import sqlContext.implicits._ 
> import org.apache.spark.sql.functions 
> case class Test(a:Int, b:Int) 
> val data = sparkContext.parallelize(Array.range(0, 10).map(x => Test(x, 
> x+1))) 
> val df = data.toDF() 
> val arrayCol = functions.array(df("a"), df("b")).as("arrayCol")
> arrayCol.toString()
> {code}
> Error message:
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.PrettyAttribute.nullable(namedExpressions.scala:289)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$dataType$3.apply(complexTypeCreator.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$dataType$3.apply(complexTypeCreator.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
>   at 
> scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:47)
>   at scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
>   at scala.collection.AbstractSeq.prefixLength(Seq.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
>   at scala.collection.mutable.ArrayBuffer.exists(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray.dataType(complexTypeCreator.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.NamedExpression$class.typeSuffix(namedExpressions.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.typeSuffix(namedExpressions.scala:120)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:155)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:207)
>   at org.apache.spark.sql.Column.toString(Column.scala:138)
>   at java.lang.String.valueOf(String.java:2994)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:331)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:20)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-10 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141243#comment-15141243
 ] 

Evan Chan commented on SPARK-12449:
---

[~rxin] I agree with [~stephank85] and others that this would be a huge help.  
At the very least, if the expressions could be pushed down that would help a 
lot.  Many databases are doing custom work to get the pushdowns needed, and I 
was thinking of doing something very similar and was going to propose something 
just like this.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13267) Document ?params for the v1 REST API

2016-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141104#comment-15141104
 ] 

Apache Spark commented on SPARK-13267:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/11152

> Document ?params for the v1 REST API
> 
>
> Key: SPARK-13267
> URL: https://issues.apache.org/jira/browse/SPARK-13267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> There's some various ? param options in the v1 rest API, which don't get any 
> mention except in the HistoryServerSuite. They should be documented in 
> monitoring.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13267) Document ?params for the v1 REST API

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13267:


Assignee: (was: Apache Spark)

> Document ?params for the v1 REST API
> 
>
> Key: SPARK-13267
> URL: https://issues.apache.org/jira/browse/SPARK-13267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> There's some various ? param options in the v1 rest API, which don't get any 
> mention except in the HistoryServerSuite. They should be documented in 
> monitoring.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13267) Document ?params for the v1 REST API

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13267:


Assignee: Apache Spark

> Document ?params for the v1 REST API
> 
>
> Key: SPARK-13267
> URL: https://issues.apache.org/jira/browse/SPARK-13267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> There's some various ? param options in the v1 rest API, which don't get any 
> mention except in the HistoryServerSuite. They should be documented in 
> monitoring.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13268) SQL Timestamp stored as GMT but toString returns GMT-08:00

2016-02-10 Thread Ilya Ganelin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-13268:
-
Description: 
There is an issue with how timestamps are displayed/converted to Strings in 
Spark SQL. The documentation states that the timestamp should be created in the 
GMT time zone, however, if we do so, we see that the output actually contains a 
-8 hour offset:

{code}
new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
res144: java.sql.Timestamp = 2014-12-31 16:00:00.0

new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
{code}

This result is confusing, unintuitive, and introduces issues when converting 
from DataFrames containing timestamps to RDDs which are then saved as text. 
This has the effect of essentially shifting all dates in a dataset by 1 day. 

The suggested fix for this is to update the timestamp toString representation 
to either a) Include timezone or b) Correctly display in GMT.

This change may well introduce substantial and insidious bugs so I'm not sure 
how best to resolve this.


  was:
There is an issue with how timestamps are displayed/converted to Strings in 
Spark SQL. The documentation states that the timestamp should be created in the 
GMT time zone, however, if we do so, we see that the output actually contains a 
-8 hour offset:

{code}
new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
res144: java.sql.Timestamp = 2014-12-31 16:00:00.0

new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
{code}

This result is confusing, unintuitive, and introduces issues when converting 
from DataFrames containing timestamps to RDDs which are then saved as text. 
This has the effect of essentially shifting all dates in a dataset by 1 day. 



> SQL Timestamp stored as GMT but toString returns GMT-08:00
> --
>
> Key: SPARK-13268
> URL: https://issues.apache.org/jira/browse/SPARK-13268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ilya Ganelin
>
> There is an issue with how timestamps are displayed/converted to Strings in 
> Spark SQL. The documentation states that the timestamp should be created in 
> the GMT time zone, however, if we do so, we see that the output actually 
> contains a -8 hour offset:
> {code}
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
> res144: java.sql.Timestamp = 2014-12-31 16:00:00.0
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
> res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
> {code}
> This result is confusing, unintuitive, and introduces issues when converting 
> from DataFrames containing timestamps to RDDs which are then saved as text. 
> This has the effect of essentially shifting all dates in a dataset by 1 day. 
> The suggested fix for this is to update the timestamp toString representation 
> to either a) Include timezone or b) Correctly display in GMT.
> This change may well introduce substantial and insidious bugs so I'm not sure 
> how best to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13268) SQL Timestamp stored as GMT but toString returns GMT-08:00

2016-02-10 Thread Ilya Ganelin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-13268:
-
Description: 
There is an issue with how timestamps are displayed/converted to Strings in 
Spark SQL. The documentation states that the timestamp should be created in the 
GMT time zone, however, if we do so, we see that the output actually contains a 
-8 hour offset:

{code}
new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
res144: java.sql.Timestamp = 2014-12-31 16:00:00.0

new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
{code}

This result is confusing, unintuitive, and introduces issues when converting 
from DataFrames containing timestamps to RDDs which are then saved as text. 
This has the effect of essentially shifting all dates in a dataset by 1 day. 


  was:
There is an issue with how timestamps are displayed/converted to Strings in 
Spark SQL. The documentation states that the timestamp should be created in the 
GMT time zone, however, if we do so, we see that the output actually contains a 
-8 hour offset:

{{ 
new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
res144: java.sql.Timestamp = 2014-12-31 16:00:00.0

new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
}}

This result is confusing, unintuitive, and introduces issues when converting 
from DataFrames containing timestamps to RDDs which are then saved as text. 
This has the effect of essentially shifting all dates in a dataset by 1 day. 



> SQL Timestamp stored as GMT but toString returns GMT-08:00
> --
>
> Key: SPARK-13268
> URL: https://issues.apache.org/jira/browse/SPARK-13268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ilya Ganelin
>
> There is an issue with how timestamps are displayed/converted to Strings in 
> Spark SQL. The documentation states that the timestamp should be created in 
> the GMT time zone, however, if we do so, we see that the output actually 
> contains a -8 hour offset:
> {code}
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
> res144: java.sql.Timestamp = 2014-12-31 16:00:00.0
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
> res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
> {code}
> This result is confusing, unintuitive, and introduces issues when converting 
> from DataFrames containing timestamps to RDDs which are then saved as text. 
> This has the effect of essentially shifting all dates in a dataset by 1 day. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13268) SQL Timestamp stored as GMT but toString returns GMT-08:00

2016-02-10 Thread Ilya Ganelin (JIRA)
Ilya Ganelin created SPARK-13268:


 Summary: SQL Timestamp stored as GMT but toString returns GMT-08:00
 Key: SPARK-13268
 URL: https://issues.apache.org/jira/browse/SPARK-13268
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Ilya Ganelin


There is an issue with how timestamps are displayed/converted to Strings in 
Spark SQL. The documentation states that the timestamp should be created in the 
GMT time zone, however, if we do so, we see that the output actually contains a 
-8 hour offset:

{{ 
new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
res144: java.sql.Timestamp = 2014-12-31 16:00:00.0

new 
Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
}}

This result is confusing, unintuitive, and introduces issues when converting 
from DataFrames containing timestamps to RDDs which are then saved as text. 
This has the effect of essentially shifting all dates in a dataset by 1 day. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13267) Document ?params for the v1 REST API

2016-02-10 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-13267:
--

 Summary: Document ?params for the v1 REST API
 Key: SPARK-13267
 URL: https://issues.apache.org/jira/browse/SPARK-13267
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.6.0
Reporter: Steve Loughran
Priority: Minor


There's some various ? param options in the v1 rest API, which don't get any 
mention except in the HistoryServerSuite. They should be documented in 
monitoring.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2016-02-10 Thread Prosper Burq (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140906#comment-15140906
 ] 

Prosper Burq commented on SPARK-11085:
--

Hi,

Is this problem still unresolved ? I tried several option but could not find 
out to allows spark-submit to connect through the proxy. I tried to pass 
environment variable through different ways but none of them worked. 


> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null

2016-02-10 Thread mathieu longtin (JIRA)
mathieu longtin created SPARK-13266:
---

 Summary: Python DataFrameReader converts None to "None" instead of 
null
 Key: SPARK-13266
 URL: https://issues.apache.org/jira/browse/SPARK-13266
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.0
 Environment: Linux standalone but probably applies to all
Reporter: mathieu longtin


If you do something like this:
{code:none}
tsv_loader = sqlContext.read.format('com.databricks.spark.csv')
tsv_loader.options(quote=None, escape=None)
{code}

The loader sees the string "None" as the _quote_ and _escape_ options. The 
loader should get a _null_.

An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, 
correct the _to_str_ function. Here's the patch:

{code:none}
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index a3d7eca..ba18d13 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"]

 def to_str(value):
 """
-A wrapper over str(), but convert bool values to lower case string
+A wrapper over str(), but convert bool values to lower case string, and 
keep None
 """
 if isinstance(value, bool):
 return str(value).lower()
+elif value is None:
+return value
 else:
 return str(value)

{code}

This has been tested and works great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS

2016-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13265:


Assignee: Apache Spark

> Refactoring of basic ML import/export for other file system besides HDFS
> 
>
> Key: SPARK-13265
> URL: https://issues.apache.org/jira/browse/SPARK-13265
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>
> We can't save a model into other file system besides HDFS, for example Amazon 
> S3. Because the file system is fixed at Spark 1.6.
> https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78
> When I tried to export a KMeans model into Amazon S3, I got the error.
> {noformat}
> scala> val kmeans = new KMeans().setK(2)
> scala> val model = kmeans.fit(train)
> scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/")
> java.lang.IllegalArgumentException: Wrong FS: 
> s3n://test-bucket/tmp/test-kmeans, expected: 
> hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c
> om:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC.(:51)
> at $iwC.(:53)
> at (:55)
> at .(:59)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.ap

  1   2   >