[jira] [Created] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-07 Thread Twinkle Sachdeva (JIRA)
Twinkle Sachdeva created SPARK-6735:
---

 Summary: Provide options to make maximum executor failure count ( 
which kills the application ) relative to a window duration or disable it.
 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.3.0, 1.2.1, 1.2.0
Reporter: Twinkle Sachdeva


Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
maximum number of executor failures, after which Application fails.
For long running applications, user can require not to kill the application at 
all or will require such setting relative to a window duration. This 
improvement is ti provide such options to make maximum executor failure count ( 
which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream

2015-04-07 Thread Alberto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482752#comment-14482752
 ] 

Alberto commented on SPARK-6431:


I think you're right Cody. I've been having a look at my code and I've found 
out that I'm not creating the topic before creating the DirectStream. If you 
are interested here is the test I am running: 
https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala

I completely agree with you, If the topic doesn't exist it should be returning 
an error and not a misleading empty set.



 Couldn't find leader offsets exception when creating KafkaDirectStream
 --

 Key: SPARK-6431
 URL: https://issues.apache.org/jira/browse/SPARK-6431
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Alberto

 When I try to create an InputDStream using the createDirectStream method of 
 the KafkaUtils class and the kafka topic does not have any messages yet am 
 getting the following error:
 org.apache.spark.SparkException: Couldn't find leader offsets for Set()
 org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't 
 find leader offsets for Set()
   at 
 org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
 If I put a message in the topic before creating the DirectStream everything 
 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream

2015-04-07 Thread Alberto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482752#comment-14482752
 ] 

Alberto edited comment on SPARK-6431 at 4/7/15 7:43 AM:


You're absolutely right Cody. I've been having a look at my code and I've found 
out that I'm not creating the topic before creating the DirectStream. If you 
are interested here is the test I am running: 
https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala

I completely agree with you, If the topic doesn't exist it should be returning 
an error and not a misleading empty set.




was (Author: ardlema):
I think you're right Cody. I've been having a look at my code and I've found 
out that I'm not creating the topic before creating the DirectStream. If you 
are interested here is the test I am running: 
https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala

I completely agree with you, If the topic doesn't exist it should be returning 
an error and not a misleading empty set.



 Couldn't find leader offsets exception when creating KafkaDirectStream
 --

 Key: SPARK-6431
 URL: https://issues.apache.org/jira/browse/SPARK-6431
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Alberto

 When I try to create an InputDStream using the createDirectStream method of 
 the KafkaUtils class and the kafka topic does not have any messages yet am 
 getting the following error:
 org.apache.spark.SparkException: Couldn't find leader offsets for Set()
 org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't 
 find leader offsets for Set()
   at 
 org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
 If I put a message in the topic before creating the DirectStream everything 
 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6721) IllegalStateException when connecting to MongoDB using spark-submit

2015-04-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Rodríguez Trejo updated SPARK-6721:

Summary: IllegalStateException when connecting to MongoDB using 
spark-submit  (was: IllegalStateException)

 IllegalStateException when connecting to MongoDB using spark-submit
 ---

 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo
  Labels: MongoDB, java.lang.IllegalStateexception, 
 saveAsNewAPIHadoopFile

 I get the following exception when using saveAsNewAPIHadoopFile:
 {code}
 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
 10.0.2.15): java.lang.IllegalStateException: open
 at org.bson.util.Assertions.isTrue(Assertions.java:36)
 at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
 at com.mongodb.DBCollection.insert(DBCollection.java:161)
 at com.mongodb.DBCollection.insert(DBCollection.java:107)
 at com.mongodb.DBCollection.save(DBCollection.java:1049)
 at com.mongodb.DBCollection.save(DBCollection.java:1014)
 at 
 com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Before Spark 1.3.0 this would result in the application crashing, but now the 
 data just remains unprocessed.
 There is no close instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6736) Example of Graph#aggregateMessages has error

2015-04-07 Thread Sasaki Toru (JIRA)
Sasaki Toru created SPARK-6736:
--

 Summary: Example of Graph#aggregateMessages has error
 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


Example of Graph#aggregateMessages has error.
Since aggregateMessages is a method of Graph, It should be written 
rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6708) Using Hive UDTF may throw ClassNotFoundException

2015-04-07 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482805#comment-14482805
 ] 

Cheng Lian commented on SPARK-6708:
---

Thanks for pointing out this. I followed SPARK-4854 and found that this issue 
duplicates SPARK-4811.

 Using Hive UDTF may throw ClassNotFoundException
 

 Key: SPARK-6708
 URL: https://issues.apache.org/jira/browse/SPARK-6708
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian

 Spark shell session for reproducing this issue:
 {code}
 import sqlContext._
 sql(create table t1 (str string))
 sql(select v.va from t1 lateral view json_tuple(str, 'a') v as 
 va).queryExecution.analyzed
 {code}
 Exception thrown:
 {noformat}
 java.lang.ClassNotFoundException: json_tuple
 at 
 scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at 
 org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:148)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:280)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:280)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:285)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:285)
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:291)
 at 
 org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
 at scala.Option.map(Option.scala:145)
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:60)
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:70)
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:117)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2$$anonfun$11.apply(Analyzer.scala:292)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2$$anonfun$11.apply(Analyzer.scala:292)
 at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:292)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:284)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:252)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:252)
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:251)
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 

[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL

2015-04-07 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4811:
--
Affects Version/s: 1.2.1
   1.3.0

 Custom UDTFs not working in Spark SQL
 -

 Key: SPARK-4811
 URL: https://issues.apache.org/jira/browse/SPARK-4811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.1, 1.3.0
Reporter: Saurabh Santhosh
Priority: Critical

 I am using the Thrift srever interface to Spark SQL and using beeline to 
 connect to it.
 I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the 
 following exception when using any custom UDTF.
 These are the steps i did :
 *Created a UDTF 'com.x.y.xxx'.*
 Registered the UDTF using following query : 
 *create temporary function xxx as 'com.x.y.xxx'*
 The registration went through without any errors. But when i tried executing 
 the UDTF i got the following error.
 *java.lang.ClassNotFoundException: xxx*
 Funny thing is that Its trying to load the function name instead of the 
 funtion class. The exception is at *line no: 81 in hiveudfs.scala*
 I have been at it for quite a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482813#comment-14482813
 ] 

Apache Spark commented on SPARK-6736:
-

User 'sasakitoa' has created a pull request for this issue:
https://github.com/apache/spark/pull/5388

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory

2015-04-07 Thread Tao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li updated SPARK-6737:
--
Description: 
I am using spark streaming(1.3.1)  as a long time running service and out of 
memory after running for 7 days. 

I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
class cause the OOM. 
authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, 
TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which 
will remove stageId from authorizedCommittersByStage. But the method stageEnd 
is never called by DAGSchedule. And it cause the authorizedCommittersByStage's 
stage info never be cleaned, which cause OOM.

It happens in my spark streaming program(1.3.1), I am not sure if it will 
appear in other spark components and other spark version.

  was:
I am using spark streaming(1.3.1)  as a long time running service and out of 
memory after running for 7 weeks. 

I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
class cause the OOM. 
authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, 
TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which 
will remove stageId from authorizedCommittersByStage. But the method stageEnd 
is never called by DAGSchedule. And it cause the authorizedCommittersByStage's 
stage info never be cleaned, which cause OOM.

It happens in my spark streaming program(1.3.1), I am not sure if it will 
appear in other spark components and other spark version.


 OutputCommitCoordinator.authorizedCommittersByStage map out of memory
 -

 Key: SPARK-6737
 URL: https://issues.apache.org/jira/browse/SPARK-6737
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Affects Versions: 1.3.0
 Environment: spark 1.3.1
Reporter: Tao Li
Priority: Critical
  Labels: Bug, Core, DAGScheduler, OOM, Streaming

 I am using spark streaming(1.3.1)  as a long time running service and out of 
 memory after running for 7 days. 
 I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
 class cause the OOM. 
 authorizedCommittersByStage is a map, key is StageId, value is 
 Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a 
 method stageEnd which will remove stageId from authorizedCommittersByStage. 
 But the method stageEnd is never called by DAGSchedule. And it cause the 
 authorizedCommittersByStage's stage info never be cleaned, which cause OOM.
 It happens in my spark streaming program(1.3.1), I am not sure if it will 
 appear in other spark components and other spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482749#comment-14482749
 ] 

Paweł Kopiczko commented on SPARK-6514:
---

I named the param `regionName` after the one in the 
KinesisClientLibConfiguration. I thought it will obvious when someone is 
familiar with KCL consumer API. But dynamoRegion is also good. We can extract 
region from streamURL but maybe someone would want to set different region than 
the stream location. Why do you think that implementation exposure is not good? 
Don't you think that everything that is configurable in KCL client that we use 
should also be configurable when creating DStream?

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6721) IllegalStateException when connecting to MongoDB using spark-submit

2015-04-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482798#comment-14482798
 ] 

Luis Rodríguez Trejo commented on SPARK-6721:
-

[~sowen] thank you for your comments. I modified de name. I think it is more 
meaningful now.
Referring to your first comment, actually I had no idea about which was the 
component that caused the problem. I apologize. Should I close the issue?

 IllegalStateException when connecting to MongoDB using spark-submit
 ---

 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo
  Labels: MongoDB, java.lang.IllegalStateexception, 
 saveAsNewAPIHadoopFile

 I get the following exception when using saveAsNewAPIHadoopFile:
 {code}
 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
 10.0.2.15): java.lang.IllegalStateException: open
 at org.bson.util.Assertions.isTrue(Assertions.java:36)
 at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
 at com.mongodb.DBCollection.insert(DBCollection.java:161)
 at com.mongodb.DBCollection.insert(DBCollection.java:107)
 at com.mongodb.DBCollection.save(DBCollection.java:1049)
 at com.mongodb.DBCollection.save(DBCollection.java:1014)
 at 
 com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Before Spark 1.3.0 this would result in the application crashing, but now the 
 data just remains unprocessed.
 There is no close instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4811) Custom UDTFs not working in Spark SQL

2015-04-07 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-4811.
---
Resolution: Duplicate

Although this ticket was opened earlier, I mark this a duplicate of SPARK-6708 
because SPARK-6708 gives clear reproduction steps.

 Custom UDTFs not working in Spark SQL
 -

 Key: SPARK-4811
 URL: https://issues.apache.org/jira/browse/SPARK-4811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.1, 1.3.0
Reporter: Saurabh Santhosh
Priority: Critical

 I am using the Thrift srever interface to Spark SQL and using beeline to 
 connect to it.
 I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the 
 following exception when using any custom UDTF.
 These are the steps i did :
 *Created a UDTF 'com.x.y.xxx'.*
 Registered the UDTF using following query : 
 *create temporary function xxx as 'com.x.y.xxx'*
 The registration went through without any errors. But when i tried executing 
 the UDTF i got the following error.
 *java.lang.ClassNotFoundException: xxx*
 Funny thing is that Its trying to load the function name instead of the 
 funtion class. The exception is at *line no: 81 in hiveudfs.scala*
 I have been at it for quite a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory

2015-04-07 Thread Tao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li updated SPARK-6737:
--
Description: 
I am using spark streaming(1.3.1)  as a long time running service and out of 
memory after running for 7 weeks. 

I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
class cause the OOM. 
authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, 
TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which 
will remove stageId from authorizedCommittersByStage. But the method stageEnd 
is never called by DAGSchedule. And it cause the authorizedCommittersByStage's 
stage info never be cleaned, which cause OOM.

It happens in my spark streaming program(1.3.1), I am not sure if it will 
appear in other spark components and other spark version.

  was:
I am using spark streaming(1.3.1)  as a long time running service and out of 
memory after running for 7 weeks. I found that the field 
authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. 
authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, 
TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which 
will remove stageId from authorizedCommittersByStage. But the method stageEnd 
is never called by DAGSchedule. And it cause the authorizedCommittersByStage's 
stage info never be cleaned, which cause OOM.

It happens in my spark streaming program(1.3.1), I am not sure if it will 
appear in other spark components and other spark version.


 OutputCommitCoordinator.authorizedCommittersByStage map out of memory
 -

 Key: SPARK-6737
 URL: https://issues.apache.org/jira/browse/SPARK-6737
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Affects Versions: 1.3.0
 Environment: spark 1.3.1
Reporter: Tao Li
Priority: Critical
  Labels: Bug, Core, DAGScheduler, OOM, Streaming

 I am using spark streaming(1.3.1)  as a long time running service and out of 
 memory after running for 7 weeks. 
 I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
 class cause the OOM. 
 authorizedCommittersByStage is a map, key is StageId, value is 
 Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a 
 method stageEnd which will remove stageId from authorizedCommittersByStage. 
 But the method stageEnd is never called by DAGSchedule. And it cause the 
 authorizedCommittersByStage's stage info never be cleaned, which cause OOM.
 It happens in my spark streaming program(1.3.1), I am not sure if it will 
 appear in other spark components and other spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory

2015-04-07 Thread Tao Li (JIRA)
Tao Li created SPARK-6737:
-

 Summary: OutputCommitCoordinator.authorizedCommittersByStage map 
out of memory
 Key: SPARK-6737
 URL: https://issues.apache.org/jira/browse/SPARK-6737
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Affects Versions: 1.3.0
 Environment: spark 1.3.1
Reporter: Tao Li
Priority: Critical


I am using spark streaming(1.3.1)  as a long time running service and out of 
memory after running for 7 weeks. I found that the field 
authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. 
authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, 
TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which 
will remove stageId from authorizedCommittersByStage. But the method stageEnd 
is never called by DAGSchedule. And it cause the authorizedCommittersByStage's 
stage info never be cleaned, which cause OOM.

It happens in my spark streaming program(1.3.1), I am not sure if it will 
appear in other spark components and other spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-07 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482768#comment-14482768
 ] 

uncleGen commented on SPARK-6695:
-

 [~srowen] Thanks for your patience.  Yeah, it is a good fix with the smallest 
number of changes. In my practical use, we usually need to create a big array, 
sometimes the size of array may exceed 2^32. So IMHO, we may provide a general 
external `Iterator` in view of usability and memory usage. Well, 
[PR-5364|https://github.com/apache/spark/pull/5364] is enough on that issue.

 Add an external iterator: a hadoop-like output collector
 

 Key: SPARK-6695
 URL: https://issues.apache.org/jira/browse/SPARK-6695
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: uncleGen

 In practical use, we usually need to create a big iterator, which means too 
 big in `memory usage` or too long in `array size`. On the one hand, it leads 
 to too much memory consumption. On the other hand, one `Array` may not hold 
 all the elements, as java array indices are of type 'int' (4 bytes or 32 
 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
 any others, and could spill data into disk. The use case may like:
 {code: borderStyle=solid}
rdd.mapPartition { it = 
   ...
   val collector = new ExternalCollector()
   collector.collect(a)
   ...
   collector.iterator
   }

 {code}
 I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4854) Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2015-04-07 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482802#comment-14482802
 ] 

Cheng Lian commented on SPARK-4854:
---

[~wanshenghua] Is the XXX in ClassNotFoundException a class name or a UDTF 
function name? I'm trying to figure out whether this is related to SPARK-6708. 
Thanks!

 Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI
 -

 Key: SPARK-4854
 URL: https://issues.apache.org/jira/browse/SPARK-4854
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1
Reporter: Shenghua Wan

 Hello, 
 I met a problem when using Spark sql CLI. A custom UDTF with lateral view 
 throws ClassNotFound exception. I did a couple of experiments in same 
 environment (spark version 1.1.0, 1.1.1): 
 select + same custom UDTF (Passed) 
 select + lateral view + custom UDTF (ClassNotFoundException) 
 select + lateral view + built-in UDTF (Passed) 
 I have done some googling there days and found one related issue ticket of 
 Spark 
 https://issues.apache.org/jira/browse/SPARK-4811
 which is about Custom UDTFs not working in Spark SQL. 
 It should be helpful to put actual code here to reproduce the problem. 
 However,  corporate regulations might prohibit this. So sorry about this. 
 Directly using explode's source code in a jar will help anyway. 
 Here is a portion of stack print when exception, just in case: 
 java.lang.ClassNotFoundException: XXX 
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) 
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
 at java.security.AccessController.doPrivileged(Native Method) 
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425) 
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358) 
 at 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:81)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.createFunction(hiveUdfs.scala:247) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:254)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:254) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors$lzycompute(hiveUdfs.scala:261)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors(hiveUdfs.scala:260)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:265)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:265) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:269) 
 at 
 org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50)
  
 at scala.Option.map(Option.scala:145) 
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:50)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:60)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79)
  
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  
 at scala.collection.immutable.List.foreach(List.scala:318) 
 at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) 
 at 
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) 
 the rest is omitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6736:
---

Assignee: (was: Apache Spark)

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6736:
---

Assignee: Apache Spark

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Assignee: Apache Spark
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-6736:
---
Summary: [GraphX]Example of Graph#aggregateMessages has error  (was: 
Example of Graph#aggregateMessages has error)

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482803#comment-14482803
 ] 

Apache Spark commented on SPARK-6736:
-

User 'sasakitoa' has created a pull request for this issue:
https://github.com/apache/spark/pull/5387

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-6736:
---
Component/s: Documentation

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6716) Change SparkContext.DRIVER_IDENTIFIER from 'driver' to 'driver'

2015-04-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6716.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5372
[https://github.com/apache/spark/pull/5372]

 Change SparkContext.DRIVER_IDENTIFIER from 'driver' to 'driver'
 -

 Key: SPARK-6716
 URL: https://issues.apache.org/jira/browse/SPARK-6716
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.4.0


 Currently, the driver's executorId is set to {{driver}}.  This choice of ID 
 was present in older Spark versions, but it has started to cause problems now 
 that executorIds are used in more contexts, such as Ganglia metric names or 
 driver thread-dump links the web UI.  The angle brackets must be escaped when 
 embedding this ID in XML or as part of URLs and this has led to multiple 
 problems:
 - https://issues.apache.org/jira/browse/SPARK-6484
 - https://issues.apache.org/jira/browse/SPARK-4313
 The simplest solution seems to be to change this id to something that does 
 not contain any special characters, such as {{driver}}. 
 I'm not sure whether we can perform this change in a patch release, since 
 this ID may be considered a stable API by metrics users, but it's probably 
 okay to do this in a major release as long as we document it in the release 
 notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482685#comment-14482685
 ] 

Apache Spark commented on SPARK-6691:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5385

 Abstract and add a dynamic RateLimiter for Spark Streaming
 --

 Key: SPARK-6691
 URL: https://issues.apache.org/jira/browse/SPARK-6691
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Flow control (or rate control) for input data is very important in streaming 
 system, especially for Spark Streaming to keep stable and up-to-date. The 
 unexpected flood of incoming data or the high ingestion rate of input data 
 which beyond the computation power of cluster will make the system unstable 
 and increase the delay time. For Spark Streaming’s job generation and 
 processing pattern, this delay will be accumulated and introduce unacceptable 
 exceptions.
 
 Currently in Spark Streaming’s receiver based input stream, there’s a 
 RateLimiter in BlockGenerator which controls the ingestion rate of input 
 data, but the current implementation has several limitations:
 # The max ingestion rate is set by user through configuration beforehand, 
 user may lack the experience of how to set an appropriate value before the 
 application is running.
 # This configuration is fixed through the life-time of application, which 
 means you need to consider the worst scenario to set a reasonable 
 configuration.
 # Input stream like DirectKafkaInputStream need to maintain another solution 
 to achieve the same functionality.
 # Lack of slow start control makes the whole system easily trapped into large 
 processing and scheduling delay at the very beginning.
 
 So here we propose a new dynamic RateLimiter as well as the new interface for 
 the RateLimiter to better improve the whole system's stability. The target is:
 * Dynamically adjust the ingestion rate according to processing rate of 
 previous finished jobs.
 * Offer an uniform solution not only for receiver based input stream, but 
 also for direct stream like DirectKafkaInputStream and new ones.
 * Slow start rate to control the network congestion when job is started.
 * Pluggable framework to make the maintenance of extension more easy.
 
 Here is the design doc 
 (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
  and working branch 
 (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
 Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py

2015-04-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6636.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2

Issue resolved by pull request 5302
[https://github.com/apache/spark/pull/5302]

 Use public DNS hostname everywhere in spark_ec2.py
 --

 Key: SPARK-6636
 URL: https://issues.apache.org/jira/browse/SPARK-6636
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Matt Aasted
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 The spark_ec2.py script uses public_dns_name everywhere in the script except 
 for testing ssh availability, which is done using the public ip address of 
 the instances. This breaks the script for users who are deploying the cluster 
 with a private-network-only security group. The fix is to use public_dns_name 
 in the remaining place.
 I am submitting a pull-request alongside this bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py

2015-04-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6636:
--
Assignee: Matt Aasted

 Use public DNS hostname everywhere in spark_ec2.py
 --

 Key: SPARK-6636
 URL: https://issues.apache.org/jira/browse/SPARK-6636
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Matt Aasted
Assignee: Matt Aasted
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 The spark_ec2.py script uses public_dns_name everywhere in the script except 
 for testing ssh availability, which is done using the public ip address of 
 the instances. This breaks the script for users who are deploying the cluster 
 with a private-network-only security group. The fix is to use public_dns_name 
 in the remaining place.
 I am submitting a pull-request alongside this bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482711#comment-14482711
 ] 

Yu Ishikawa commented on SPARK-6682:


I think so. In conclusion, I agree with the [~josephkb] 's proposal.
I've just announced this issue to the developers list. Someone may be going to 
join this discussion. However, I think we should make a conclusion, using the 
[~mengxr] 's opinion as reference.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6691:
---

Assignee: (was: Apache Spark)

 Abstract and add a dynamic RateLimiter for Spark Streaming
 --

 Key: SPARK-6691
 URL: https://issues.apache.org/jira/browse/SPARK-6691
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Flow control (or rate control) for input data is very important in streaming 
 system, especially for Spark Streaming to keep stable and up-to-date. The 
 unexpected flood of incoming data or the high ingestion rate of input data 
 which beyond the computation power of cluster will make the system unstable 
 and increase the delay time. For Spark Streaming’s job generation and 
 processing pattern, this delay will be accumulated and introduce unacceptable 
 exceptions.
 
 Currently in Spark Streaming’s receiver based input stream, there’s a 
 RateLimiter in BlockGenerator which controls the ingestion rate of input 
 data, but the current implementation has several limitations:
 # The max ingestion rate is set by user through configuration beforehand, 
 user may lack the experience of how to set an appropriate value before the 
 application is running.
 # This configuration is fixed through the life-time of application, which 
 means you need to consider the worst scenario to set a reasonable 
 configuration.
 # Input stream like DirectKafkaInputStream need to maintain another solution 
 to achieve the same functionality.
 # Lack of slow start control makes the whole system easily trapped into large 
 processing and scheduling delay at the very beginning.
 
 So here we propose a new dynamic RateLimiter as well as the new interface for 
 the RateLimiter to better improve the whole system's stability. The target is:
 * Dynamically adjust the ingestion rate according to processing rate of 
 previous finished jobs.
 * Offer an uniform solution not only for receiver based input stream, but 
 also for direct stream like DirectKafkaInputStream and new ones.
 * Slow start rate to control the network congestion when job is started.
 * Pluggable framework to make the maintenance of extension more easy.
 
 Here is the design doc 
 (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
  and working branch 
 (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
 Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6691:
---

Assignee: Apache Spark

 Abstract and add a dynamic RateLimiter for Spark Streaming
 --

 Key: SPARK-6691
 URL: https://issues.apache.org/jira/browse/SPARK-6691
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao
Assignee: Apache Spark

 Flow control (or rate control) for input data is very important in streaming 
 system, especially for Spark Streaming to keep stable and up-to-date. The 
 unexpected flood of incoming data or the high ingestion rate of input data 
 which beyond the computation power of cluster will make the system unstable 
 and increase the delay time. For Spark Streaming’s job generation and 
 processing pattern, this delay will be accumulated and introduce unacceptable 
 exceptions.
 
 Currently in Spark Streaming’s receiver based input stream, there’s a 
 RateLimiter in BlockGenerator which controls the ingestion rate of input 
 data, but the current implementation has several limitations:
 # The max ingestion rate is set by user through configuration beforehand, 
 user may lack the experience of how to set an appropriate value before the 
 application is running.
 # This configuration is fixed through the life-time of application, which 
 means you need to consider the worst scenario to set a reasonable 
 configuration.
 # Input stream like DirectKafkaInputStream need to maintain another solution 
 to achieve the same functionality.
 # Lack of slow start control makes the whole system easily trapped into large 
 processing and scheduling delay at the very beginning.
 
 So here we propose a new dynamic RateLimiter as well as the new interface for 
 the RateLimiter to better improve the whole system's stability. The target is:
 * Dynamically adjust the ingestion rate according to processing rate of 
 previous finished jobs.
 * Offer an uniform solution not only for receiver based input stream, but 
 also for direct stream like DirectKafkaInputStream and new ones.
 * Slow start rate to control the network congestion when job is started.
 * Pluggable framework to make the maintenance of extension more easy.
 
 Here is the design doc 
 (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
  and working branch 
 (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
 Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error

2015-04-07 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-6736.
---
Resolution: Fixed

Issue resolved by pull request 5388
[https://github.com/apache/spark/pull/5388]

 [GraphX]Example of Graph#aggregateMessages has error
 

 Key: SPARK-6736
 URL: https://issues.apache.org/jira/browse/SPARK-6736
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, GraphX
Affects Versions: 1.3.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.4.0


 Example of Graph#aggregateMessages has error.
 Since aggregateMessages is a method of Graph, It should be written 
 rawGraph.aggregateMessages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6739:
---

Assignee: Apache Spark

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Assignee: Apache Spark
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482944#comment-14482944
 ] 

Apache Spark commented on SPARK-6739:
-

User 'tijoparacka' has created a pull request for this issue:
https://github.com/apache/spark/pull/5389

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6739:
---

Assignee: (was: Apache Spark)

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482917#comment-14482917
 ] 

Yu Ishikawa commented on SPARK-6682:


Thanks for replying, [~mengxr].
How about dividing this issue into a few sub issues? Since the area of 
influence is a little vast. I think we have a few tasks in it.

- Add `@deprecated` tags into Scala/Java
- Modify the examples
- Modify the documentation
- (Share this idea with the community in order not to add more static train() 
methods)

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Tijo Thomas (JIRA)
Tijo Thomas created SPARK-6739:
--

 Summary: Spark SQL Example gives errors due to missing import of 
Types org.apache.spark.sql.types
 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial


Missing import  in example script under the section Programmatically 
Specifying the Schema

scala val schema =
 |   StructType(
 | schemaString.split( ).map(fieldName = StructField(fieldName, 
StringType, true)))
console:25: error: not found: value StructType
 StructType(
 ^





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)
Hong Shen created SPARK-6738:


 Summary: EstimateSize  is difference with spill file size
 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen


ExternalAppendOnlyMap spill 1100M data to disk:
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)

But the file size is only 1.1M.
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b

The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482884#comment-14482884
 ] 

Xiangrui Meng commented on SPARK-6682:
--

+1 on deprecating the static train methods and not removing them for binary 
compatibility in Java/Scala. Python API could remain the same.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6721) IllegalStateException when connecting to MongoDB using spark-submit

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482950#comment-14482950
 ] 

Sean Owen commented on SPARK-6721:
--

From the stack trace, the error is coming from the Mongo driver. That is mild 
evidence that something is wrong with / misconfigured within the driver. It 
looks like it is asserting that a connection is open when it is not. Maybe you 
are reusing a closed connection or using it before it's configured / open?

I usually try to find the source code to see what is going on, like:
http://grepcode.com/file/repo1.maven.org/maven2/org.mongodb/mongo-java-driver/2.13.0/com/mongodb/DBTCPConnector.java/
I also find it's deprecated in the latest code
https://github.com/mongodb/mongo-java-driver/blob/77b7974d8be49c45dcba01c32a8458b121092f87/config/clirr-exclude.yml

Maybe review your code's usage of the Mongo driver in light of the stack trace 
and source? If you see a reasonable theory about how Spark is not quite calling 
the Hadoop output format correctly (though here, it's just the generic Hadoop 
output path, which is used heavily and therefore should be OK), post that here. 
Otherwise yeah I suspect you have an issue in your code.

 IllegalStateException when connecting to MongoDB using spark-submit
 ---

 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo
  Labels: MongoDB, java.lang.IllegalStateexception, 
 saveAsNewAPIHadoopFile

 I get the following exception when using saveAsNewAPIHadoopFile:
 {code}
 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
 10.0.2.15): java.lang.IllegalStateException: open
 at org.bson.util.Assertions.isTrue(Assertions.java:36)
 at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
 at com.mongodb.DBCollection.insert(DBCollection.java:161)
 at com.mongodb.DBCollection.insert(DBCollection.java:107)
 at com.mongodb.DBCollection.save(DBCollection.java:1049)
 at com.mongodb.DBCollection.save(DBCollection.java:1014)
 at 
 com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Before Spark 1.3.0 this would result in the application crashing, but now the 
 data just remains unprocessed.
 There is no close instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6732) Scala existentials warning during compilation

2015-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6732.
--
Resolution: Duplicate

 Scala existentials warning during compilation
 -

 Key: SPARK-6732
 URL: https://issues.apache.org/jira/browse/SPARK-6732
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
 Environment: operating system: OSX Yosemite
 scala version: 2.10.4
 hardware: 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3
Reporter: Raymond Tay
Priority: Minor

 Certain parts of the Scala code was detected to have used existentials but 
 the scala import can be included in the source file to prevent such warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6740) SQL operator and condition precedence is not honoured

2015-04-07 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-6740:
---

 Summary: SQL operator and condition precedence is not honoured
 Key: SPARK-6740
 URL: https://issues.apache.org/jira/browse/SPARK-6740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola


The following query from the SQL Logic Test suite fails to parse:

SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL

while the following (equivalent) does parse correctly:

SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL)

SQLite, MySQL and Oracle (and probably most SQL implementations) define IS with 
higher precedence than NOT, so the first query is valid and well-defined.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6695.
--
Resolution: Won't Fix

I suppose my problem with that is that it would be duplicating Spark's spill 
mechanism and leaves open the questions I put forth above about cleanup. Spark 
functions aren't supposed to need a huge amount of memory all at once, and so 
I imagine the solution in every case is to redesign the method.

 Add an external iterator: a hadoop-like output collector
 

 Key: SPARK-6695
 URL: https://issues.apache.org/jira/browse/SPARK-6695
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: uncleGen

 In practical use, we usually need to create a big iterator, which means too 
 big in `memory usage` or too long in `array size`. On the one hand, it leads 
 to too much memory consumption. On the other hand, one `Array` may not hold 
 all the elements, as java array indices are of type 'int' (4 bytes or 32 
 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
 any others, and could spill data into disk. The use case may like:
 {code: borderStyle=solid}
rdd.mapPartition { it = 
   ...
   val collector = new ExternalCollector()
   collector.collect(a)
   ...
   collector.iterator
   }

 {code}
 I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482987#comment-14482987
 ] 

Sean Owen commented on SPARK-6738:
--

Is that the only file spilled though? I'm not an expert but it looks like lots 
of files are spilled to here.

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.8 MB to disk (15 times so far)
 But the file size is only 1.1M.
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)
{code}

But the file size is only 1.1M.

{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
{code}

The estimateSize  is hugh difference with spill file size

  was:
ExternalAppendOnlyMap spill 1100M data to disk:
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)

But the file size is only 1.1M.
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b

The estimateSize  is hugh difference with spill file size


(Please use the code tag to format output)

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 {code}
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO 

[jira] [Updated] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6733:
-
Priority: Trivial  (was: Major)

What file are you talking about?

 Suppression of usage of Scala existential code should be done
 -

 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
 Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay
Priority: Trivial

 The inclusion of this statement in the file 
 {code:scala}
 import scala.language.existentials
 {code}
 should have suppressed all warnings regarding the use of scala existential 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Tijo Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483034#comment-14483034
 ] 

Tijo Thomas commented on SPARK-6739:


Please close this duplicate issue

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Tijo Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483034#comment-14483034
 ] 

Tijo Thomas edited comment on SPARK-6739 at 4/7/15 11:26 AM:
-

Please close this duplicate issue
How ever the previous fix is not reflecting in the documentation under section 
Programmatically Specifying the Schema :: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options


was (Author: tijo paracka):
Please close this duplicate issue
How ever the previous fix is not reflecting in the documentation.

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6739.
--
Resolution: Not A Problem

It's because the site hasn't been published again with a next release since the 
fix.

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)
{code}

But the file size is only 1.1M.

{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
{code}

Here is the other spilled file.
{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
{code}

The estimateSize  is hugh difference with spill file size

  was:
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)

[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483011#comment-14483011
 ] 

Hong Shen commented on SPARK-6738:
--

Yes, it spill lots of files, but each one has only 1.1M. 

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 {code}
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.8 MB to disk (15 times so far)
 {code}
 But the file size is only 1.1M.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 {code}
 Here is the other spilled file.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
  
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)
{code}

But the file size is only 1.1M.

{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
{code}

Here are the other spilled files.
{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
{code}

The estimateSize  is hugh difference with spill file size

  was:
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)

[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)
{code}

But the file size is only 1.1M.

{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
{code}

Here is the other spilled files.
{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
{code}

The estimateSize  is hugh difference with spill file size

  was:
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)

[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-07 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483018#comment-14483018
 ] 

Twinkle Sachdeva commented on SPARK-6735:
-

Created a PR here : https://github.com/twinkle-sachdeva/spark/pull/1

 Provide options to make maximum executor failure count ( which kills the 
 application ) relative to a window duration or disable it.
 ---

 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Twinkle Sachdeva

 Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
 maximum number of executor failures, after which Application fails.
 For long running applications, user can require not to kill the application 
 at all or will require such setting relative to a window duration. This 
 improvement is ti provide such options to make maximum executor failure count 
 ( which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483025#comment-14483025
 ] 

Sean Owen commented on SPARK-6738:
--

Do you observe a problem? is it possible that you are looking at unserialized 
objects in memory but serialized representation on disk? what is the nature of 
the data? More info would be much more helpful

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 {code}
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.8 MB to disk (15 times so far)
 {code}
 But the file size is only 1.1M.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 {code}
 Here are the other spilled files.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
  
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types

2015-04-07 Thread Tijo Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483034#comment-14483034
 ] 

Tijo Thomas edited comment on SPARK-6739 at 4/7/15 11:25 AM:
-

Please close this duplicate issue
How ever the previous fix is not reflecting in the documentation.


was (Author: tijo paracka):
Please close this duplicate issue

 Spark SQL Example gives errors due to missing import of Types 
 org.apache.spark.sql.types
 

 Key: SPARK-6739
 URL: https://issues.apache.org/jira/browse/SPARK-6739
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 Missing import  in example script under the section Programmatically 
 Specifying the Schema
 scala val schema =
  |   StructType(
  | schemaString.split( ).map(fieldName = StructField(fieldName, 
 StringType, true)))
 console:25: error: not found: value StructType
  StructType(
  ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-07 Thread Raymond Tay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483036#comment-14483036
 ] 

Raymond Tay commented on SPARK-6733:


Apologize. It's DAGScheduler.scala 

 Suppression of usage of Scala existential code should be done
 -

 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
 Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay
Priority: Trivial

 The inclusion of this statement in the file 
 {code:scala}
 import scala.language.existentials
 {code}
 should have suppressed all warnings regarding the use of scala existential 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6420) Driver's Block Manager does not use spark.driver.host in Yarn-Client mode

2015-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6420.
--
Resolution: Duplicate

 Driver's Block Manager does not use spark.driver.host in Yarn-Client mode
 ---

 Key: SPARK-6420
 URL: https://issues.apache.org/jira/browse/SPARK-6420
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Reporter: Liangliang Gu

 In my cluster, the yarn node does not know the client's host name.
 So I set spark.driver.host to the ip address of the client.
 But the driver's Block Manager does not use spark.driver.host but the 
 hostname in Yarn-Client mode.
 I got the following error:
  TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2, hadoop-node1538098): 
 java.io.IOException: Failed to connect to example-hostname
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.nio.channels.UnresolvedAddressException
 at sun.nio.ch.Net.checkAddress(Net.java:127)
 at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:200)
 at 
 io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
 at 
 io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
 at 
 io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
 at 
 io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
 at 
 io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
 at 
 io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
 at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:199)
 at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 ... 1 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6741) Add support for SELECT ALL syntax

2015-04-07 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-6741:
---

 Summary: Add support for SELECT ALL syntax
 Key: SPARK-6741
 URL: https://issues.apache.org/jira/browse/SPARK-6741
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor


Support SELECT ALL syntax (equivalent to SELECT, without DISTINCT).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6742:
--

This is same as SPARK-6554 for new parquet path

 Spark pushes down filters in old parquet path that reference partitioning 
 columns
 -

 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta

 Create a table with multiple fields partitioned on 'market' column. run a 
 query like : 
 SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
 csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
 (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
 (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
 start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
 '2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC 
 LIMIT 100
 The or filter is pushed down in this case , resulting in column not found 
 exception from parquet 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Yash Datta (JIRA)
Yash Datta created SPARK-6742:
-

 Summary: Spark pushes down filters in old parquet path that 
reference partitioning columns
 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta


Create a table with multiple fields partitioned on 'market' column. run a query 
like : 

SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
(region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
(bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
'2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 
100

The or filter is pushed down in this case , resulting in column not found 
exception from parquet 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6742:
---

Assignee: Apache Spark

 Spark pushes down filters in old parquet path that reference partitioning 
 columns
 -

 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta
Assignee: Apache Spark

 Create a table with multiple fields partitioned on 'market' column. run a 
 query like : 
 SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
 csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
 (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
 (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
 start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
 '2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC 
 LIMIT 100
 The or filter is pushed down in this case , resulting in column not found 
 exception from parquet 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6742:
---

Assignee: (was: Apache Spark)

 Spark pushes down filters in old parquet path that reference partitioning 
 columns
 -

 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta

 Create a table with multiple fields partitioned on 'market' column. run a 
 query like : 
 SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
 csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
 (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
 (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
 start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
 '2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC 
 LIMIT 100
 The or filter is pushed down in this case , resulting in column not found 
 exception from parquet 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483095#comment-14483095
 ] 

Apache Spark commented on SPARK-6742:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/5390

 Spark pushes down filters in old parquet path that reference partitioning 
 columns
 -

 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta

 Create a table with multiple fields partitioned on 'market' column. run a 
 query like : 
 SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
 csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
 (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
 (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
 start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
 '2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC 
 LIMIT 100
 The or filter is pushed down in this case , resulting in column not found 
 exception from parquet 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 2.2 GB data to disk:

{code}

15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
in-memory map of 2.2 GB to disk (61 times so far)
15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

But the file size is only 2.2M.

{code}
ll -h 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
total 2.2M
-rw-r- 1 spark users 2.2M Apr  7 20:27 
temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

The GC log show that the jvm memory is less than 1GB.
{code}
2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
{code}

The estimateSize  is hugh difference with spill file size

  was:
ExternalAppendOnlyMap spill 1100M data to disk:

{code}
15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.5 MB to disk (12 times so far)
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.3 MB to disk (13 times so far)
/data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1105.8 MB to disk (14 times so far)
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
in-memory map of 1106.8 MB to disk (15 times so far)
{code}

But the file size is only 1.1M.

{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
-rw-r- 1 spark users 1.1M Apr  7 16:39 
/data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
{code}

Here are the other spilled files.
{code}
[tdwadmin@tdw-10-215-149-231 
~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
 ll -h 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
 
/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
total 2.2M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
-rw-r- 1 spark users 1.1M Apr  7 16:39 
temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:41 
temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f

/data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
total 1.1M
-rw-r- 1 spark users 1.1M Apr  7 16:40 
temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9


[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483103#comment-14483103
 ] 

Sean Owen commented on SPARK-6738:
--

To be clear I am asking how big the data being spilled is in memory. The GC 
state isnt relevant. That is, are they just compressing 10x on serialization 
into the files you see? It is not crazy.

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483104#comment-14483104
 ] 

Hong Shen commented on SPARK-6738:
--

I don't think it's serialized cause the problem. the input data is a hive 
table, and the spark job is a spark SQL.
In the fact, when the log show that spilling in-memory map of 2.2 GB to disk, 
the file is only 2.2M, and the GC log show the jvm is less than 1GB. the 
estimateSize also deviation with the jvm memory.


 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6612) Python KMeans parity

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483109#comment-14483109
 ] 

Apache Spark commented on SPARK-6612:
-

User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/5391

 Python KMeans parity
 

 Key: SPARK-6612
 URL: https://issues.apache.org/jira/browse/SPARK-6612
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Hrishikesh
Priority: Minor

 This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
 are missing:
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6612) Python KMeans parity

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6612:
---

Assignee: Hrishikesh  (was: Apache Spark)

 Python KMeans parity
 

 Key: SPARK-6612
 URL: https://issues.apache.org/jira/browse/SPARK-6612
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Hrishikesh
Priority: Minor

 This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
 are missing:
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6612) Python KMeans parity

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6612:
---

Assignee: Apache Spark  (was: Hrishikesh)

 Python KMeans parity
 

 Key: SPARK-6612
 URL: https://issues.apache.org/jira/browse/SPARK-6612
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
 are missing:
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6743) Join with empty projection on one side produces invalid results

2015-04-07 Thread Santiago M. Mola (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-6743:

Priority: Critical  (was: Major)

 Join with empty projection on one side produces invalid results
 ---

 Key: SPARK-6743
 URL: https://issues.apache.org/jira/browse/SPARK-6743
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Critical

 {code:java}
 val sqlContext = new SQLContext(sc)
 val tab0 = sc.parallelize(Seq(
   (83,0,38),
   (26,0,79),
   (43,81,24)
 ))
 sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0), 
 tab0)
 sqlContext.cacheTable(tab0)   
 val df1 = sqlContext.sql(SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP 
 BY tab0._2, cor0._2)
 val result1 = df1.collect()
 val df2 = sqlContext.sql(SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY 
 cor0._2)
 val result2 = df2.collect()
 val df3 = sqlContext.sql(SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2)
 val result3 = df3.collect()
 {code}
 Given the previous code, result2 equals to Row(43), Row(83), Row(26), which 
 is wrong. These results correspond to cor0._1, instead of cor0._2. Correct 
 results would be Row(0), Row(81), which are ok for the third query. The first 
 query also produces valid results, and the only difference is that the left 
 side of the join is not empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6743) Join with empty projection on one side produces invalid results

2015-04-07 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-6743:
---

 Summary: Join with empty projection on one side produces invalid 
results
 Key: SPARK-6743
 URL: https://issues.apache.org/jira/browse/SPARK-6743
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola


{code:java}
val sqlContext = new SQLContext(sc)
val tab0 = sc.parallelize(Seq(
  (83,0,38),
  (26,0,79),
  (43,81,24)
))
sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0), 
tab0)
sqlContext.cacheTable(tab0)   
val df1 = sqlContext.sql(SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP BY 
tab0._2, cor0._2)
val result1 = df1.collect()
val df2 = sqlContext.sql(SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY cor0._2)
val result2 = df2.collect()
val df3 = sqlContext.sql(SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2)
val result3 = df3.collect()
{code}

Given the previous code, result2 equals to Row(43), Row(83), Row(26), which is 
wrong. These results correspond to cor0._1, instead of cor0._2. Correct results 
would be Row(0), Row(81), which are ok for the third query. The first query 
also produces valid results, and the only difference is that the left side of 
the join is not empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming

2015-04-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483174#comment-14483174
 ] 

Emre Sevinç edited comment on SPARK-3276 at 4/7/15 2:36 PM:


[~srowen] would it be fine if I added a public API method on FileInputDStream 
class that takes a single parameter (duration) and sets the value of 
{{MIN_REMEMBER_DURATION}} to that value? And of course, at the same time 
changing MIN_REMEMBER_DURATION from a constant into a variable, with a default 
value of 1 minute (that is the currently hard-coded value).

Or, as an alternative to achieve the similar effect: Create another Spark 
configuration property (with a default value of 1 minute) and re-factor the 
code so that (the new) {{minRememberDuration}} variable takes its value from 
that property.

Right now, I have no idea which of the above two approaches is more meaningful 
/ idiomatic. Any comments?


was (Author: emres):
[~srowen] would it be fine if I added a public API method on FileInputDStream 
class that takes a single parameter (duration) and sets the value of 
MIN_REMEMBER_DURATION to that value? And of course, at the same time changing 
MIN_REMEMBER_DURATION from a constant into a variable, with a default value of 
1 minute (that is the currently hard-coded value).

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3591) Provide fire and forget option for YARN cluster mode

2015-04-07 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3591.
--
  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Tao Wang
Target Version/s: 1.4.0  (was: 1.2.0)

 Provide fire and forget option for YARN cluster mode
 --

 Key: SPARK-3591
 URL: https://issues.apache.org/jira/browse/SPARK-3591
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Tao Wang
 Fix For: 1.4.0


 After launching an application through yarn-cluster mode, the SparkSubmit 
 process sticks around and enters a monitoring loop to track the application's 
 status. This is really a responsibility that belongs to a different process, 
 such that SparkSubmit can run yarn-cluster applications in a fire-and-forget 
 mode. We currently already do this for standalone-cluster mode. we should do 
 it for yarn-cluster mode too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483275#comment-14483275
 ] 

Apache Spark commented on SPARK-6602:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5392

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming

2015-04-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483174#comment-14483174
 ] 

Emre Sevinç commented on SPARK-3276:


[~srowen] would it be fine if I added a public API method on FileInputDStream 
class that takes a single parameter (duration) and sets the value of 
MIN_REMEMBER_DURATION to that value? And of course, at the same time changing 
MIN_REMEMBER_DURATION from a constant into a variable, with a default value of 
1 minute (that is the currently hard-coded value).

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6744) Add support for CROSS JOIN syntax

2015-04-07 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-6744:
---

 Summary: Add support for CROSS JOIN syntax
 Key: SPARK-6744
 URL: https://issues.apache.org/jira/browse/SPARK-6744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
 Environment: Add support for the standard CROSS JOIN syntax.
Reporter: Santiago M. Mola
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483164#comment-14483164
 ] 

Apache Spark commented on SPARK-5242:
-

User 'mdagost' has created a pull request for this issue:
https://github.com/apache/spark/pull/5244

 ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is 
 available
 ---

 Key: SPARK-5242
 URL: https://issues.apache.org/jira/browse/SPARK-5242
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor
  Labels: easyfix

 How to reproduce: user starting cluster in VPC needs to wait forever:
 {code}
 ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
 --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
 Setting up security groups...
 Searching for existing cluster SparkByScript...
 Spark AMI: ami-1ae0166d
 Launching instances...
 Launched 1 slaves in eu-west-1a, regid = r-e70c5502
 Launched master in eu-west-1a, regid = r-bf0f565a
 Waiting for cluster to enter 'ssh-ready' state..{forever}
 {code}
 Problem is that current code makes wrong assumption that VPC instance has 
 public_dns_name or public ip_address. Actually more common is that VPC 
 instance has only private_ip_address.
 The bug is already fixed in my fork, I am going to submit pull request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF

2015-04-07 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Summary: Support List as a return type in Hive UDF  (was: Support List as 
a return type in Hive UDF)

 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility

2015-04-07 Thread Ilya Ganelin (JIRA)
Ilya Ganelin created SPARK-6746:
---

 Summary: Refactor large functions in DAGScheduler to improve 
readibility
 Key: SPARK-6746
 URL: https://issues.apache.org/jira/browse/SPARK-6746
 Project: Spark
  Issue Type: Sub-task
Reporter: Ilya Ganelin


The DAGScheduler class contains two huge functions that make it 
very hard to understand what's going on in the code. These are:

1) The monolithic handleTaskCompletion 
2) The cleanupStateForJobAndIndependentStages function

These can be simply modularized to eliminate some awkward type casting and 
improve code readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF

2015-04-07 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Description: 
The current implementation can't handle List as a return type in Hive UDF.
We assume an UDF below;

public class UDFToListString extends UDF {
public ListString evaluate(Object o) {
return Arrays.asList(xxx, yyy, zzz);
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...

To fix this problem, we need to add an entry for List in 
HiveInspectors#javaClassToDataType.
However, it has one difficulty because of type erasure in JVM.
We assume that lines below are appended in HiveInspectors#javaClassToDataType;

// list type
case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =
val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
println(tpe.getActualTypeArguments()(0).toString()) = 'E'

This logic fails to catch a component type in List.

  was:
The current implementation can't handle List as a return type in Hive UDF.
We assume an UDF below;

public class UDFToListString extends UDF {
public ListString evaluate(Object o) {
return Arrays.asList(xxx, yyy, zzz);
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...

To fix this problem, we need to add an entry for List in 
HiveInspectors#javaClassToDataType.



 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 

[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF

2015-04-07 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Description: 
The current implementation can't handle List as a return type in Hive UDF.
We assume an UDF below;

public class UDFToListString extends UDF {
public ListString evaluate(Object o) {
return Arrays.asList(xxx, yyy, zzz);
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...

To fix this problem, we need to add an entry for List in 
HiveInspectors#javaClassToDataType.


  was:
The current implementation can't handle List as a return type in Hive UDF.
We assume an UDF below;

public class UDFToListString extends UDF {
public ListString evaluate(Object o) {
return Arrays.asList(xxx, yyy, zzz);
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...


 Support List as a return type in Hive UDF
 -

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6745) Develop a general filter function to be used in PrunedFilteredScan and CatalystScan

2015-04-07 Thread Alex Liu (JIRA)
Alex Liu created SPARK-6745:
---

 Summary: Develop a general filter function to be used in  
PrunedFilteredScan and CatalystScan
 Key: SPARK-6745
 URL: https://issues.apache.org/jira/browse/SPARK-6745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Alex Liu


During integrating PrunedFilteredScan/CatalystScan with Cassandra. Some filters 
are pushed down to Cassandra Spark connector, others are left out. The left out 
filters need to applied to the Rdd[Row] obtained from Cassandra Spark 
Connector. 

Rdd[Row].filter(generalFilter)

{code}
def filter(f: T = Boolean): RDD[T] 
generalFilter: Row = Boolean
{code}

Where generalFilter combines filters left and applies them to Row.  This is a 
general function which could be applied to many data source integration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6747) Support List as a return type in Hive UDF

2015-04-07 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-6747:
---

 Summary: Support List as a return type in Hive UDF
 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro


The current implementation can't handle List as a return type in Hive UDF.
We assume an UDF below;

public class UDFToListString extends UDF {
public ListString evaluate(Object o) {
return Arrays.asList(xxx, yyy, zzz);
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5114) Should Evaluator be a PipelineStage

2015-04-07 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483335#comment-14483335
 ] 

Peter Rudenko commented on SPARK-5114:
--

+1 for should. For my use case (create pipeline from config file, sometimes 
there is need to do evaluation with custom metrics (e.g. gini norm, etc.), 
sometimes there's no need to do evaluation, it would be done on other part of 
the system). Would be more flexible for if evaluator would be a part of 
pipeline.

 Should Evaluator be a PipelineStage
 ---

 Key: SPARK-5114
 URL: https://issues.apache.org/jira/browse/SPARK-5114
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Pipelines can currently contain Estimators and Transformers.
 Question for debate: Should Pipelines be able to contain Evaluators?
 Pros:
 * Evaluators take input datasets with particular schema, which should perhaps 
 be checked before running a Pipeline.
 Cons:
 * Evaluators do not transform datasets.   They produce a scalar (or a few 
 values), which makes it hard to say how they fit into a Pipeline or a 
 PipelineModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6746:
---

Assignee: Apache Spark

 Refactor large functions in DAGScheduler to improve readibility
 ---

 Key: SPARK-6746
 URL: https://issues.apache.org/jira/browse/SPARK-6746
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Ilya Ganelin
Assignee: Apache Spark

 The DAGScheduler class contains two huge functions that make it 
 very hard to understand what's going on in the code. These are:
 1) The monolithic handleTaskCompletion 
 2) The cleanupStateForJobAndIndependentStages function
 These can be simply modularized to eliminate some awkward type casting and 
 improve code readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5818) unable to use add jar in hql

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5818:
---

Assignee: (was: Apache Spark)

 unable to use add jar in hql
 --

 Key: SPARK-5818
 URL: https://issues.apache.org/jira/browse/SPARK-5818
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: pengxu

 In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar  
 in hql.
 It seems that the problem in spark-2219 is still existed.
 the problem can be reproduced as described in the below. Suppose the jar file 
 is named brickhouse-0.6.0.jar and is placed in the /tmp directory
 {code}
 spark-shellimport org.apache.spark.sql.hive._
 spark-shellval sqlContext = new HiveContext(sc)
 spark-shellimport sqlContext._
 spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar)
 {code}
 the error message is showed as blow
 {code:title=Error Log}
 15/02/15 01:36:31 ERROR SessionState: Unable to register 
 /tmp/brickhouse-0.6.0.jar
 Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
 cast to java.net.URLClassLoader
 java.lang.ClassCastException: 
 org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
 java.net.URLClassLoader
   at 
 org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
   at 
 org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
   at 
 org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39)
   at $line30.$read$$iwC$$iwC$$iwC.init(console:41)
   at $line30.$read$$iwC$$iwC.init(console:43)
   at $line30.$read$$iwC.init(console:45)
   at $line30.$read.init(console:47)
   at $line30.$read$.init(console:51)
   at $line30.$read$.clinit(console)
   at $line30.$eval$.init(console:7)
   at $line30.$eval$.clinit(console)
   at $line30.$eval.$print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
   at 

[jira] [Commented] (SPARK-5818) unable to use add jar in hql

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483376#comment-14483376
 ] 

Apache Spark commented on SPARK-5818:
-

User 'gvramana' has created a pull request for this issue:
https://github.com/apache/spark/pull/5393

 unable to use add jar in hql
 --

 Key: SPARK-5818
 URL: https://issues.apache.org/jira/browse/SPARK-5818
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: pengxu

 In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar  
 in hql.
 It seems that the problem in spark-2219 is still existed.
 the problem can be reproduced as described in the below. Suppose the jar file 
 is named brickhouse-0.6.0.jar and is placed in the /tmp directory
 {code}
 spark-shellimport org.apache.spark.sql.hive._
 spark-shellval sqlContext = new HiveContext(sc)
 spark-shellimport sqlContext._
 spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar)
 {code}
 the error message is showed as blow
 {code:title=Error Log}
 15/02/15 01:36:31 ERROR SessionState: Unable to register 
 /tmp/brickhouse-0.6.0.jar
 Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
 cast to java.net.URLClassLoader
 java.lang.ClassCastException: 
 org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
 java.net.URLClassLoader
   at 
 org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
   at 
 org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
   at 
 org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39)
   at $line30.$read$$iwC$$iwC$$iwC.init(console:41)
   at $line30.$read$$iwC$$iwC.init(console:43)
   at $line30.$read$$iwC.init(console:45)
   at $line30.$read.init(console:47)
   at $line30.$read$.init(console:51)
   at $line30.$read$.clinit(console)
   at $line30.$eval$.init(console:7)
   at $line30.$eval$.clinit(console)
   at $line30.$eval.$print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
   at 

[jira] [Assigned] (SPARK-5818) unable to use add jar in hql

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5818:
---

Assignee: Apache Spark

 unable to use add jar in hql
 --

 Key: SPARK-5818
 URL: https://issues.apache.org/jira/browse/SPARK-5818
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: pengxu
Assignee: Apache Spark

 In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar  
 in hql.
 It seems that the problem in spark-2219 is still existed.
 the problem can be reproduced as described in the below. Suppose the jar file 
 is named brickhouse-0.6.0.jar and is placed in the /tmp directory
 {code}
 spark-shellimport org.apache.spark.sql.hive._
 spark-shellval sqlContext = new HiveContext(sc)
 spark-shellimport sqlContext._
 spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar)
 {code}
 the error message is showed as blow
 {code:title=Error Log}
 15/02/15 01:36:31 ERROR SessionState: Unable to register 
 /tmp/brickhouse-0.6.0.jar
 Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
 cast to java.net.URLClassLoader
 java.lang.ClassCastException: 
 org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
 java.net.URLClassLoader
   at 
 org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
   at 
 org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
   at 
 org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39)
   at $line30.$read$$iwC$$iwC$$iwC.init(console:41)
   at $line30.$read$$iwC$$iwC.init(console:43)
   at $line30.$read$$iwC.init(console:45)
   at $line30.$read.init(console:47)
   at $line30.$read$.init(console:51)
   at $line30.$read$.clinit(console)
   at $line30.$eval$.init(console:7)
   at $line30.$eval$.clinit(console)
   at $line30.$eval.$print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
   at 

[jira] [Assigned] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6746:
---

Assignee: (was: Apache Spark)

 Refactor large functions in DAGScheduler to improve readibility
 ---

 Key: SPARK-6746
 URL: https://issues.apache.org/jira/browse/SPARK-6746
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Ilya Ganelin

 The DAGScheduler class contains two huge functions that make it 
 very hard to understand what's going on in the code. These are:
 1) The monolithic handleTaskCompletion 
 2) The cleanupStateForJobAndIndependentStages function
 These can be simply modularized to eliminate some awkward type casting and 
 improve code readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483488#comment-14483488
 ] 

Apache Spark commented on SPARK-6746:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5394

 Refactor large functions in DAGScheduler to improve readibility
 ---

 Key: SPARK-6746
 URL: https://issues.apache.org/jira/browse/SPARK-6746
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Ilya Ganelin

 The DAGScheduler class contains two huge functions that make it 
 very hard to understand what's going on in the code. These are:
 1) The monolithic handleTaskCompletion 
 2) The cleanupStateForJobAndIndependentStages function
 These can be simply modularized to eliminate some awkward type casting and 
 improve code readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6747) Support List as a return type in Hive UDF

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6747:
---

Assignee: Apache Spark

 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro
Assignee: Apache Spark

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.
 However, it has one difficulty because of type erasure in JVM.
 We assume that lines below are appended in HiveInspectors#javaClassToDataType;
 // list type
 case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =
 val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
 println(tpe.getActualTypeArguments()(0).toString()) = 'E'
 This logic fails to catch a component type in List.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6750) Upgrade ScalaStyle to 0.7

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6750:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Upgrade ScalaStyle to 0.7
 -

 Key: SPARK-6750
 URL: https://issues.apache.org/jira/browse/SPARK-6750
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Apache Spark

 0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return 
 explicit type definition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6750) Upgrade ScalaStyle to 0.7

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6750:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Upgrade ScalaStyle to 0.7
 -

 Key: SPARK-6750
 URL: https://issues.apache.org/jira/browse/SPARK-6750
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin

 0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return 
 explicit type definition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-04-07 Thread Sai Nishanth Parepally (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482297#comment-14482297
 ] 

Sai Nishanth Parepally edited comment on SPARK-3219 at 4/7/15 6:10 PM:
---

[~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering 
going to be merged into mllib as I would like to use jaccard distance as a 
distance metric for kmeans clustering? and I would like to know if I should add 
this distance metric to derrickburns's repository or just make the current 
mllib's implementation of kmeans accept a method which computes the distance 
between any two points?


was (Author: nishanthps):
[~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering 
going to be merged into mllib as I would like to use jaccard distance as a 
distance metric for kmeans clustering?

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns
  Labels: clustering

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6751) Spark History Server support multiple application attempts

2015-04-07 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-6751:


 Summary: Spark History Server support multiple application attempts
 Key: SPARK-6751
 URL: https://issues.apache.org/jira/browse/SPARK-6751
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves


Spark on Yarn supports running multiple application attempts (configurable 
number) in case the first (or second..) attempts fail.  The Spark History 
server only supports one history file though.  Under the default configs it 
keeps the first attempts history file. You can set the undocumented config 
spark.eventLog.overwrite to allow the follow on attempts to overwrite the first 
attempts history file.

Note that in spark 1.2 not having the overwrite config set causes any following 
attempts to actually fail to run, in spark 1.3 they run and you just see a 
warning at the end of the attempts.

It would be really nice to have an option that keeps all the attempts history 
files.  This way a user can go back and look at each one individually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6733.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Suppression of usage of Scala existential code should be done
 -

 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
 Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay
Priority: Trivial
 Fix For: 1.4.0


 The inclusion of this statement in the file 
 {code:scala}
 import scala.language.existentials
 {code}
 should have suppressed all warnings regarding the use of scala existential 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6750) Upgrade ScalaStyle to 0.7

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483642#comment-14483642
 ] 

Apache Spark commented on SPARK-6750:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5399

 Upgrade ScalaStyle to 0.7
 -

 Key: SPARK-6750
 URL: https://issues.apache.org/jira/browse/SPARK-6750
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin

 0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return 
 explicit type definition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483547#comment-14483547
 ] 

Apache Spark commented on SPARK-6746:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5396

 Refactor large functions in DAGScheduler to improve readibility
 ---

 Key: SPARK-6746
 URL: https://issues.apache.org/jira/browse/SPARK-6746
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Ilya Ganelin

 The DAGScheduler class contains two huge functions that make it 
 very hard to understand what's going on in the code. These are:
 1) The monolithic handleTaskCompletion 
 2) The cleanupStateForJobAndIndependentStages function
 These can be simply modularized to eliminate some awkward type casting and 
 improve code readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6748) QueryPlan.schema should be a lazy val to avoid creating excessive duplicate StructType objects

2015-04-07 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6748:
-

 Summary: QueryPlan.schema should be a lazy val to avoid creating 
excessive duplicate StructType objects
 Key: SPARK-6748
 URL: https://issues.apache.org/jira/browse/SPARK-6748
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Cheng Lian


Spotted this issue while trying to do a simple micro benchmark:
{code}
sc.parallelize(1 to 1000).
  map(i = (i, sval_$i)).
  toDF(key, value).
  saveAsParquetFile(file:///tmp/src.parquet)

sqlContext.parquetFile(file:///tmp/src.parquet).collect()
{code}
YJP profiling result showed that, *10 million {{StructType}}, 10 million 
{{StructField \[\]}}, and 20 million {{StructField}} were allocated*.

It turned out that {{DataFrame.collect()}} calls 
{{SparkPlan.executeCollect()}}, which consists of a single line:
{code}
execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
{code}
The problem is that, {{QueryPlan.schema}} is a function, and since 1.3.0, 
{{convertRowToScala}} starts returning a {{GenericRowWithSchema}}. These two 
facts result in 10 million rows, each with a separate schema object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory

2015-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483609#comment-14483609
 ] 

Apache Spark commented on SPARK-6737:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5397

 OutputCommitCoordinator.authorizedCommittersByStage map out of memory
 -

 Key: SPARK-6737
 URL: https://issues.apache.org/jira/browse/SPARK-6737
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core, Streaming
Affects Versions: 1.3.0, 1.3.1
 Environment: spark 1.3.1
Reporter: Tao Li
Assignee: Josh Rosen
Priority: Critical
  Labels: Bug, Core, DAGScheduler, OOM, Streaming

 I am using spark streaming(1.3.1)  as a long time running service and out of 
 memory after running for 7 days. 
 I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
 class cause the OOM. 
 authorizedCommittersByStage is a map, key is StageId, value is 
 Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a 
 method stageEnd which will remove stageId from authorizedCommittersByStage. 
 But the method stageEnd is never called by DAGSchedule. And it cause the 
 authorizedCommittersByStage's stage info never be cleaned, which cause OOM.
 It happens in my spark streaming program(1.3.1), I am not sure if it will 
 appear in other spark components and other spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6737:
---

Assignee: Apache Spark  (was: Josh Rosen)

 OutputCommitCoordinator.authorizedCommittersByStage map out of memory
 -

 Key: SPARK-6737
 URL: https://issues.apache.org/jira/browse/SPARK-6737
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core, Streaming
Affects Versions: 1.3.0, 1.3.1
 Environment: spark 1.3.1
Reporter: Tao Li
Assignee: Apache Spark
Priority: Critical
  Labels: Bug, Core, DAGScheduler, OOM, Streaming

 I am using spark streaming(1.3.1)  as a long time running service and out of 
 memory after running for 7 days. 
 I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
 class cause the OOM. 
 authorizedCommittersByStage is a map, key is StageId, value is 
 Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a 
 method stageEnd which will remove stageId from authorizedCommittersByStage. 
 But the method stageEnd is never called by DAGSchedule. And it cause the 
 authorizedCommittersByStage's stage info never be cleaned, which cause OOM.
 It happens in my spark streaming program(1.3.1), I am not sure if it will 
 appear in other spark components and other spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory

2015-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6737:
---

Assignee: Josh Rosen  (was: Apache Spark)

 OutputCommitCoordinator.authorizedCommittersByStage map out of memory
 -

 Key: SPARK-6737
 URL: https://issues.apache.org/jira/browse/SPARK-6737
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core, Streaming
Affects Versions: 1.3.0, 1.3.1
 Environment: spark 1.3.1
Reporter: Tao Li
Assignee: Josh Rosen
Priority: Critical
  Labels: Bug, Core, DAGScheduler, OOM, Streaming

 I am using spark streaming(1.3.1)  as a long time running service and out of 
 memory after running for 7 days. 
 I found that the field authorizedCommittersByStage in OutputCommitCoordinator 
 class cause the OOM. 
 authorizedCommittersByStage is a map, key is StageId, value is 
 Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a 
 method stageEnd which will remove stageId from authorizedCommittersByStage. 
 But the method stageEnd is never called by DAGSchedule. And it cause the 
 authorizedCommittersByStage's stage info never be cleaned, which cause OOM.
 It happens in my spark streaming program(1.3.1), I am not sure if it will 
 appear in other spark components and other spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >