date:20141129


 [ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4598:
--
Affects Version/s: 1.2.0

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks


[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228677#comment-14228677
 ] 

Josh Rosen commented on SPARK-4598:
---

I was able to reproduce this issue using the SparkPi example.

I captured a heap dump in YourKit and it looks like the raw, uncompressed HTML 
of the Stage page is over 75 megabytes and the Scala XML tree corresponding to 
the page is hundreds of megabytes (~200).

The actual HTML itself should be highly compressible, since it contains a lot 
of redundancy.  In the longer-term, we could also explore approaches that 
perform more of the rendering / formatting in the browser using Javascript; 
this would allow us to send the task table data as JSON or CSV, which would 
contain much less redundancy; we could also avoid the overheads of the XML 
library.

As as shorter-term hack, though, I wonder whether there's some trick to reduce 
the overall memory usage of the intermediate scala.xml data structures, since 
it seems odd that we end up materializing such a large object graph when it 
seems like large portions of it could be lazily streamed.  Maybe there's some 
simple trick where sprinkling in a few {{.iterator}} calls would improve things.

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks


[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228684#comment-14228684
 ] 

Josh Rosen commented on SPARK-4598:
---

Actually, it might be pretty hard to trim down the memory usage via scala.xml 
tricks.  Adding some functionality to return the stage table information as CSV 
data might be a cleaner way to handle this.  This doesn't necessarily imply 
using AJAX requests to load the data from the backend; we could just dump the 
CSV data into a script tag and load it via Javascript.  We might be able to 
hide all of this complexity behind the StageTableBase class, so we could this 
change to a small section of the code.

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4652) Add docs about spark-git-repo option

2014-11-29 Thread Kai Sasaki (JIRA)

Kai Sasaki created SPARK-4652:
-

 Summary: Add docs about spark-git-repo option
 Key: SPARK-4652
 URL: https://issues.apache.org/jira/browse/SPARK-4652
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Minor


It was a little hard to understand how to use --spark-git-repo option on 
spark-ec2 script. Some additional documentation might be needed to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4652) Add docs about spark-git-repo option


[ 
https://issues.apache.org/jira/browse/SPARK-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228789#comment-14228789
 ] 

Apache Spark commented on SPARK-4652:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/3513

 Add docs about spark-git-repo option
 

 Key: SPARK-4652
 URL: https://issues.apache.org/jira/browse/SPARK-4652
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Minor

 It was a little hard to understand how to use --spark-git-repo option on 
 spark-ec2 script. Some additional documentation might be needed to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4082) Show Waiting/Queued Stages in Spark UI


[ 
https://issues.apache.org/jira/browse/SPARK-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228820#comment-14228820
 ] 

Patrick Wendell commented on SPARK-4082:


IMO this is sufficiently addressed by the jobs page. Or at least, now that we'd 
have the jobs page I'd be interested in seeing if people still feel a big need 
for pending stages in the stage page.

 Show Waiting/Queued Stages in Spark UI
 --

 Key: SPARK-4082
 URL: https://issues.apache.org/jira/browse/SPARK-4082
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Pat McDonough

 In the Stages UI page, It would be helpful to show the user any stages the 
 DAGScheduler has planned but are not yet active. Currently, this info is not 
 shown to the user in any way.
 /CC [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4644) Implement skewed join

2014-11-29 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228839#comment-14228839
 ] 

Aaron Davidson commented on SPARK-4644:
---

[~zsxwing] I believe that this problem is related more fundamentally to the 
problem that Spark currently requires that all values for the same key remain 
in memory. Your solution aims to fix this for the specific case of joins, but I 
wonder if we generalize it, if we could solve this for things like groupBy as 
well.

I don't have a fully fleshed out idea yet, but I was considering a model where 
there are 2 types of shuffles: aggregation-based and rearrangement-based. 
Aggregation-based shuffles use partial aggregation and combiners to form and 
merge (K, C) pairs. Rearrangement-based shuffles do not expect a decrease in 
the amount of total data, however, and so my thought is that this model does 
not make sense.

Instead, we could provide an interface similar to ExternalAppendOnlyMap but 
which returns an Iterator[(K, Iterable[V])] pairs, with some extra semantics 
related to the Iterable[V]s (such as having a .chunkedIterator() method which 
enables block nested loops join).

In this model, join could be implemented by mapping the left side's key to (K, 
1) and the right side to (K, 2) and having logic which reads from two adjacent 
value-iterables simultaneously -- e.g., 

val ((k, 1), left: Iterable[V]) = map.next()
val ((k, 2), right: Iterable[V]) = map.next()
// perform merge using the left and right iterators.


 Implement skewed join
 -

 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
 Attachments: Skewed Join Design Doc.pdf


 Skewed data is not rare. For example, a book recommendation site may have 
 several books which are liked by most of the users. Running ALS on such 
 skewed data will raise a OutOfMemory error, if some book has too many users 
 which cannot be fit into memory. To solve it, we propose a skewed join 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.

2014-11-29 Thread Ravindra Pesala (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala updated SPARK-4648:
---
Description: 
Support Coalesce function in Spark SQL.
Support type widening in Coalesce function.
And replace Coalesce UDF in Spark Hive with local Coalesce function since it is 
memory efficient and faster.

  was:Support Coalesce function in Spark SQL


 Support Coalesce in Spark SQL.
 --

 Key: SPARK-4648
 URL: https://issues.apache.org/jira/browse/SPARK-4648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Support Coalesce function in Spark SQL.
 Support type widening in Coalesce function.
 And replace Coalesce UDF in Spark Hive with local Coalesce function since it 
 is memory efficient and faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4648) Support COALESCE function in Spark SQL and HiveQL

2014-11-29 Thread Ravindra Pesala (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala updated SPARK-4648:
---
Summary: Support COALESCE function in Spark SQL and HiveQL  (was: Support 
Coalesce in Spark SQL.)

 Support COALESCE function in Spark SQL and HiveQL
 -

 Key: SPARK-4648
 URL: https://issues.apache.org/jira/browse/SPARK-4648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Support Coalesce function in Spark SQL.
 Support type widening in Coalesce function.
 And replace Coalesce UDF in Spark Hive with local Coalesce function since it 
 is memory efficient and faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4653) DAGScheduler refactoring and cleanup

Josh Rosen created SPARK-4653:
-

 Summary: DAGScheduler refactoring and cleanup
 Key: SPARK-4653
 URL: https://issues.apache.org/jira/browse/SPARK-4653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen


This is an umbrella JIRA for DAGScheduler refactoring and cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4654) Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods

Josh Rosen created SPARK-4654:
-

 Summary: Clean up DAGScheduler's getMissingParentStages() and 
stageDependsOn() methods
 Key: SPARK-4654
 URL: https://issues.apache.org/jira/browse/SPARK-4654
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


DAGScheduler has {{getMissingParentStages()}} and {{stageDependsOn()}} methods, 
which are suspiciously similar to {{getParentStages()}}.  All of these methods 
perform traversal of the RDD / Stage graph to inspect parent stages.  We can 
remove both of these methods, though: the set of parent stages is known when a 
{{Stage}} instance is constructed and is already stored in {{Stage.parents}}, 
so we can just check for missing stages by looking for unavailable stages in 
{{Stage.parents}}.  Similarly, we can determine whether one stage depends on 
another by searching {{Stage.parents}} rather than performing the entire graph 
traversal from scratch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses

Josh Rosen created SPARK-4655:
-

 Summary: Split Stage into ShuffleMapStage and ResultStage 
subclasses
 Key: SPARK-4655
 URL: https://issues.apache.org/jira/browse/SPARK-4655
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


The scheduler's {{Stage}} class has many fields which are only applicable to 
result stages or shuffle map stages.  As a result, I think that it makes sense 
to make {{Stage}} into an abstract base class with two subclasses, 
{{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4653) DAGScheduler refactoring and cleanup


 [ 
https://issues.apache.org/jira/browse/SPARK-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-4653:
-

Assignee: Josh Rosen

 DAGScheduler refactoring and cleanup
 

 Key: SPARK-4653
 URL: https://issues.apache.org/jira/browse/SPARK-4653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 This is an umbrella JIRA for DAGScheduler refactoring and cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4654) Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods


[ 
https://issues.apache.org/jira/browse/SPARK-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228919#comment-14228919
 ] 

Apache Spark commented on SPARK-4654:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3515

 Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods
 -

 Key: SPARK-4654
 URL: https://issues.apache.org/jira/browse/SPARK-4654
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 DAGScheduler has {{getMissingParentStages()}} and {{stageDependsOn()}} 
 methods, which are suspiciously similar to {{getParentStages()}}.  All of 
 these methods perform traversal of the RDD / Stage graph to inspect parent 
 stages.  We can remove both of these methods, though: the set of parent 
 stages is known when a {{Stage}} instance is constructed and is already 
 stored in {{Stage.parents}}, so we can just check for missing stages by 
 looking for unavailable stages in {{Stage.parents}}.  Similarly, we can 
 determine whether one stage depends on another by searching {{Stage.parents}} 
 rather than performing the entire graph traversal from scratch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4622) Add the some error infomation if using spark-sql in yarn-cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4622.
---
Resolution: Duplicate

 Add the some error infomation if using spark-sql in yarn-cluster mode
 -

 Key: SPARK-4622
 URL: https://issues.apache.org/jira/browse/SPARK-4622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: carlmartin

 If using spark-sql in yarn-cluster mode, print an error infomation just as 
 the spark shell in yarn-cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228937#comment-14228937
 ] 

Josh Rosen commented on SPARK-4498:
---

Hi [~airhorns],

I finally got a chance to look into this and, based on reading the code, I have 
a theory about what might be happening.  If you look at the [current 
Master.scala 
file|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L668],
 you'll notice that there are only two situations where the standalone Master 
removes applications:

- The master receives a DisassociatedEvent due to the application actor 
shutting down and calls {{finishApplication}}.
- An executor exited with a non-zero exit status and the maximum number of 
executor failures has been succeeded.

Now, imagine that for some reason the standalone Master does not receive a 
DisassociatedEvent.  When executors eventually start to die, the standalone 
master will discover this via ExecutorStateChanged.  If it hasn't hit the 
maximum number of executor failures, [it will attempt to re-schedule the 
application|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L325]
 and obtain new resources.  If a new executor is granted, this will [cause the 
maximum failed executors count to reset to 
zero|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L313],
 leading to a sort of livelock behavior where executors die because they can't 
contact the application but keep being launched because executors keep entering 
the ExecutorState.RUNNING state ([it looks 
like|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L148]
 executors transition to this state when they launch, not once they've 
registered with the driver).

It looks like the line

{code}
  if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
{code}

was introduced in SPARK-2425.  It looks like this was introduced after the 
earliest commit that you mentioned, so it seems like this is be a regression in 
1.2.0.  I don't think that we should revert SPARK-2425, since that fixes 
another fairly important bug.  Instead, I'd like to try to figure out how an 
application could fail without a DisassociatedEvent causing it to be removed.

Could this be due to our use of non-standard Akka timeout / failure detector 
settings?  I would think that we'd still get a DisassociatedEvent when a 
network connection was closed or something.  Maybe we could switch to relying 
on our own explicit heartbeats for failure detection, like we do elsewhere in 
Spark.

[~markhamstra], do you have any ideas here?

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Critical
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification

[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


 [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4498:
--
Priority: Blocker  (was: Critical)

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main java.lang.reflect.UndeclaredThrowableException: 
 Unknown exception in doAs
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)

[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


 [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4498:
--
Affects Version/s: 1.1.1

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main java.lang.reflect.UndeclaredThrowableException: 
 Unknown exception in doAs
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)

[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


 [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4498:
--
Target Version/s: 1.2.0, 1.1.2

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main java.lang.reflect.UndeclaredThrowableException: 
 Unknown exception in doAs
   at

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228943#comment-14228943
 ] 

Josh Rosen commented on SPARK-4498:
---

Adding 1.1.1 as an affected version, too, since SPARK-2425 was backported to 
that release, too.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main java.lang.reflect.UndeclaredThrowableException:

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-29 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228947#comment-14228947
 ] 

Mark Hamstra commented on SPARK-4498:
-

On a quick look-through, your analysis looks likely to be correct, [~joshrosen].

Making sure that failed applications are always accompanied by a 
DisassociatedEvent would be a good thing.  The belt-and-suspenders fix would be 
to also change the executor state-change semantics so that either RUNNING means 
not just that the executor process is running, but also that it has 
successfully connected to the application, or else introduce an additional 
executor state (perhaps REGISTERED) along with state transitions and 
finer-grained state logic controlling executor restart and application removal.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228952#comment-14228952
 ] 

Josh Rosen commented on SPARK-4498:
---

In addition to exploring the missing DisassociatedEvent theory, it might also 
be worthwhile to brainstorm whether problems at other steps in the cleanup 
process could cause an application to fail to be removed.  I'm not sure that a 
single missing DisassociatedEvent could explain the blip behavior observed 
here, where an entire group of applications fail to be marked as completed / 
failed.

In the DisassociatedEvent handler, we index into {{addressToApp}} to determine 
which app corresponded to the DisassociatedEvent:

{code}
case DisassociatedEvent(_, address, _) = {
  // The disconnected client could've been either a worker or an app; 
remove whichever it was
  logInfo(s$address got disassociated, removing it.)
  addressToWorker.get(address).foreach(removeWorker)
  addressToApp.get(address).foreach(finishApplication)
  if (state == RecoveryState.RECOVERING  canCompleteRecovery) { 
completeRecovery() }
}
{code}

If the {{addressToApp}} entry was empty / wrong, then we wouldn't properly 
clean up the app.  However, I don't think that there should be any problems 
here because each application actor system should have its own distinct address 
and Akka's {{Address}} class properly implements hashCode / equals.  Even if 
drivers run on the same host, their actor systems should have different port 
numbers.

Continuing along:

{code}
  def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) {
if (apps.contains(app)) {
  logInfo(Removing app  + app.id)
{code}

Is there any way that the {{apps}} HashSet could fail to contain {{app}}?  I 
don't think so: {{ApplicationInfo}} doesn't override equals/hashCode, but I 
don't think that's a problem since we only create one ApplicationInfo per app, 
so the default object identity comparison should be fine.  We should probably 
log an error if we call {{removeApplication}} on an application that has 
already been removed, though.  (Also, why do we need the {{apps}} HashSet when 
we could just use {{idToApp.values}}?)

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications


[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228966#comment-14228966
 ] 

Josh Rosen commented on SPARK-4498:
---

Here's an interesting pattern to grep for in all-master-logs-around-blip.txt: 
{{sparkdri...@spark-etl1.chi.shopify.com:52047}}.  Note that this log is in 
reverse-chronological order.

The earliest occurrence is in a DisassociatedEvent log message:

{code}
14-11-19_18:48:31.34508 14/11/19 18:48:31 ERROR EndpointWriter: 
AssociationError [akka.tcp://sparkmas...@dn05.chi.shopify.com:7077] - 
[akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047]: Error [Shut down 
address: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047] [
2014-11-19_18:48:31.34510 akka.remote.ShutDownAssociation: Shut down address: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047
2014-11-19_18:48:31.34511 Caused by: 
akka.remote.transport.Transport$InvalidAssociationException: The remote system 
terminated the association because it is shutting down.
2014-11-19_18:48:31.34512 ]
2014-11-19_18:48:31.34521 14/11/19 18:48:31 INFO LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.16.126.88%3A48040-1355#-59270061]
 was not delivered. [2859] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
2014-11-19_18:48:31.34603 14/11/19 18:48:31 INFO Master: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047 got disassociated, 
removing it.
2014-11-19_18:48:31.20255 14/11/19 18:48:31 INFO Master: Removing executor 
app-20141119184815-1316/7 because it is EXITED
{code}

Even though INFO-level logging is enabled, there's no INFO: master: Removing 
app  message near this event.  The entire log contains many repetitions of 
this same DisassociatedEvent log.

The same log also contains many executors that launch and immediately fail:

{code}
2014-11-19_18:52:51.84000 14/11/19 18:52:51 INFO Master: Launching executor 
app-20141119184815-1313/75 on worker 
worker-20141118172622-dn19.chi.shopify.com-38498
2014-11-19_18:52:51.83981 14/11/19 18:52:51 INFO Master: Removing executor 
app-20141119184815-1313/67 because it is EXITED
{code}

I couldn't find a {{removing app app-20141119184815-1313}} event.

Another interesting thing: even though it looks like this log contains 
information for 39 drivers, there are 100 disassociated events:

{code}
[joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | 
grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | uniq | wc 
-l
  39

[joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | 
grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | wc -l
 100
{code}

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard

[jira] [Resolved] (SPARK-4057) Use -agentlib instead of -Xdebug in sbt-launch-lib.bash for debugging


 [ 
https://issues.apache.org/jira/browse/SPARK-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4057.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Kousuke Saruta

 Use -agentlib instead of -Xdebug in sbt-launch-lib.bash for debugging 
 --

 Key: SPARK-4057
 URL: https://issues.apache.org/jira/browse/SPARK-4057
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 1.3.0


 In -launch-lib.bash, -Xdebug option is used for debugging. We should use 
 -agentlib option for Java 6+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4505) Reduce the memory usage of CompactBuffer[T] when T is a primitive type


 [ 
https://issues.apache.org/jira/browse/SPARK-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4505.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Reduce the memory usage of CompactBuffer[T] when T is a primitive type
 --

 Key: SPARK-4505
 URL: https://issues.apache.org/jira/browse/SPARK-4505
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0


 If CompactBuffer has a ClassTag parameter, CompactBuffer can create primitive 
 arrays for primitive types. It will reduce the memory usage for primitive 
 types significantly and only pay minor performance lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4505) Reduce the memory usage of CompactBuffer[T] when T is a primitive type


 [ 
https://issues.apache.org/jira/browse/SPARK-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4505:
---
Assignee: Shixiong Zhu

 Reduce the memory usage of CompactBuffer[T] when T is a primitive type
 --

 Key: SPARK-4505
 URL: https://issues.apache.org/jira/browse/SPARK-4505
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0


 If CompactBuffer has a ClassTag parameter, CompactBuffer can create primitive 
 arrays for primitive types. It will reduce the memory usage for primitive 
 types significantly and only pay minor performance lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4628) Put external projects and examples behind a build flag


 [ 
https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4628:
---
Summary: Put external projects and examples behind a build flag  (was: Put 
all external projects behind a build flag)

 Put external projects and examples behind a build flag
 --

 Key: SPARK-4628
 URL: https://issues.apache.org/jira/browse/SPARK-4628
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Priority: Blocker

 This is something we talked about doing for convenience, but I'm escalating 
 this based on realizing today that some of our external projects depend on 
 code that is not in maven central. I.e. if one of these dependencies is taken 
 down (as happened recently with mqtt), all Spark builds will fail.
 The proposal here is simple, have a profile -Pexternal-projects that enables 
 these. This can follow the exact pattern of -Pkinesis-asl which was disabled 
 by default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4656) Typo in Programming Guide markdown

2014-11-29 Thread Kai Sasaki (JIRA)

Kai Sasaki created SPARK-4656:
-

 Summary: Typo in Programming Guide markdown
 Key: SPARK-4656
 URL: https://issues.apache.org/jira/browse/SPARK-4656
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Trivial


Grammatical error in Programming Guide document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4656) Typo in Programming Guide markdown


[ 
https://issues.apache.org/jira/browse/SPARK-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228999#comment-14228999
 ] 

Apache Spark commented on SPARK-4656:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/3412

 Typo in Programming Guide markdown
 --

 Key: SPARK-4656
 URL: https://issues.apache.org/jira/browse/SPARK-4656
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Trivial

 Grammatical error in Programming Guide document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4656) Typo in Programming Guide markdown

2014-11-29 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228998#comment-14228998
 ] 

Kai Sasaki commented on SPARK-4656:
---

Created the patch. Please review it.
https://github.com/apache/spark/pull/3412

 Typo in Programming Guide markdown
 --

 Key: SPARK-4656
 URL: https://issues.apache.org/jira/browse/SPARK-4656
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Trivial

 Grammatical error in Programming Guide document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4657) RuntimeException: Unsupported datatype DecimalType()

2014-11-29 Thread pengyanhong (JIRA)

pengyanhong created SPARK-4657:
--

 Summary: RuntimeException: Unsupported datatype DecimalType()
 Key: SPARK-4657
 URL: https://issues.apache.org/jira/browse/SPARK-4657
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: pengyanhong


java.lang.RuntimeException: Unsupported datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
at 
org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
at 
org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
at 
org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)
at 
org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33)
at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61)
at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59)
at com.jd.jddp.spark.hive.Cache.main(Cache.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4657) RuntimeException: Unsupported datatype DecimalType()

2014-11-29 Thread pengyanhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-4657:
---
Description: 
execute a query statement on a Hive table which contains decimal data type 
field, got error as below:
{quote}
java.lang.RuntimeException: Unsupported datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
at 
org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
at 
org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
at 
org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)
at 
org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33)
at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61)
at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59)
at com.jd.jddp.spark.hive.Cache.main(Cache.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459)
{quote}

  was:
java.lang.RuntimeException: Unsupported datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
at

[jira] [Updated] (SPARK-4657) RuntimeException: Unsupported datatype DecimalType()

2014-11-29 Thread pengyanhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-4657:
---
Description: 
execute a query statement on a Hive table which contains decimal data type 
field, than save the result into tachyon as parquet file, got error as below:
{quote}
java.lang.RuntimeException: Unsupported datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
at 
org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
at 
org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
at 
org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)
at 
org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33)
at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61)
at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59)
at com.jd.jddp.spark.hive.Cache.main(Cache.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459)
{quote}

  was:
execute a query statement on a Hive table which contains decimal data type 
field, got error as below:
{quote}
java.lang.RuntimeException: Unsupported datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at

[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

[
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229001#comment-14229001
]

Patrick Wendell commented on SPARK-4630:

Hey Kos - before starting to work on the design for this feature, could you try
to quantify how important this actually is for performance? I.e. give some
examples at scale in some benchmarks or user workloads?

Dynamically determine optimal number of partitions
--

Key: SPARK-4630
URL: https://issues.apache.org/jira/browse/SPARK-4630
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

Partition sizes play a big part in how fast stages execute during a Spark
job. There is a direct relationship between the size of partitions to the
number of tasks - larger partitions, fewer tasks. For better performance,
Spark has a sweet spot for how large partitions should be that get executed
by a task. If partitions are too small, then the user pays a disproportionate
cost in scheduling overhead. If the partitions are too large, then task
execution slows down due to gc pressure and spilling to disk.
To increase performance of jobs, users often hand optimize the number(size)
of partitions that the next stage gets. Factors that come into play are:
Incoming partition sizes from previous stage
number of available executors
available memory per executor (taking into account
spark.shuffle.memoryFraction)
Spark has access to this data and so should be able to automatically do the
partition sizing for the user. This feature can be turned off/on with a
configuration option.
To make this happen, we propose modifying the DAGScheduler to take into
account partition sizes upon stage completion. Before scheduling the next
stage, the scheduler can examine the sizes of the partitions and determine
the appropriate number tasks to create. Since this change requires
non-trivial modifications to the DAGScheduler, a detailed design doc will be
attached before proceeding with the work.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions

[
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229001#comment-14229001
]

Patrick Wendell edited comment on SPARK-4630 at 11/30/14 3:29 AM:
--

Spark in general is much less sensitive to the number of partitions than other
frameworks since the overhead of launching individual tasks is very small. For
this reason in the past we specifically decided not to introduce too much
complexity into Spark for this, but we did add some hueristics over time. It
seems like the proposal here is to extend the heuristics a bit, so some simpler
extensions might make sense.

was (Author: pwendell):
Hey Kos - before starting to work on the design for this feature, could you try
to quantify how important this actually is for performance? I.e. give some
examples at scale in some benchmarks or user workloads?

Dynamically determine optimal number of partitions
--

Key: SPARK-4630
URL: https://issues.apache.org/jira/browse/SPARK-4630
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4658) Code documentation issue in DDL of datasource

2014-11-29 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created SPARK-4658:
--

 Summary: Code documentation issue in DDL of datasource
 Key: SPARK-4658
 URL: https://issues.apache.org/jira/browse/SPARK-4658
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Ravindra Pesala
Priority: Minor


The syntax mentioned to create table for datasource in ddl.scala file is 
documented with wrong syntax like
{code}
  /**
   * CREATE FOREIGN TEMPORARY TABLE avroTable
   * USING org.apache.spark.sql.avro
   * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro)
   */
{code}

but the correct syntax is 
{code}
 /**
   * CREATE TEMPORARY TABLE avroTable
   * USING org.apache.spark.sql.avro
   * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro)
   */
{code}

Wrong syntax is documented in newParquet.scala like
{code}
`CREATE TABLE ... USING org.apache.spark.sql.parquet`.  
{code} 
 but the correct syntax is 
{code}
`CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet`.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4658) Code documentation issue in DDL of datasource


[ 
https://issues.apache.org/jira/browse/SPARK-4658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229005#comment-14229005
 ] 

Apache Spark commented on SPARK-4658:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/3516

 Code documentation issue in DDL of datasource
 -

 Key: SPARK-4658
 URL: https://issues.apache.org/jira/browse/SPARK-4658
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Ravindra Pesala
Priority: Minor

 The syntax mentioned to create table for datasource in ddl.scala file is 
 documented with wrong syntax like
 {code}
   /**
* CREATE FOREIGN TEMPORARY TABLE avroTable
* USING org.apache.spark.sql.avro
* OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro)
*/
 {code}
 but the correct syntax is 
 {code}
  /**
* CREATE TEMPORARY TABLE avroTable
* USING org.apache.spark.sql.avro
* OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro)
*/
 {code}
 Wrong syntax is documented in newParquet.scala like
 {code}
 `CREATE TABLE ... USING org.apache.spark.sql.parquet`.  
 {code} 
  but the correct syntax is 
 {code}
 `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet`.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4507) PR merge script should support closing multiple JIRA tickets


 [ 
https://issues.apache.org/jira/browse/SPARK-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4507.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Takayuki Hasegawa

 PR merge script should support closing multiple JIRA tickets
 

 Key: SPARK-4507
 URL: https://issues.apache.org/jira/browse/SPARK-4507
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Josh Rosen
Assignee: Takayuki Hasegawa
Priority: Minor
  Labels: starter
 Fix For: 1.3.0


 For pull requests that reference multiple JIRAs in their titles, it would be 
 helpful if the PR merge script offered to close all of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4543) Javadoc failure for network-common causes publish-local to fail