[jira] [Commented] (SPARK-2636) no where to get job identifier while submit spark job through spark API

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113400#comment-14113400
 ] 

Apache Spark commented on SPARK-2636:
-

User 'lirui-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2176

 no where to get job identifier while submit spark job through spark API
 ---

 Key: SPARK-2636
 URL: https://issues.apache.org/jira/browse/SPARK-2636
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
  Labels: hive

 In Hive on Spark, we want to track spark job status through Spark API, the 
 basic idea is as following:
 # create an hive-specified spark listener and register it to spark listener 
 bus.
 # hive-specified spark listener generate job status by spark listener events.
 # hive driver track job status through hive-specified spark listener. 
 the current problem is that hive driver need job identifier to track 
 specified job status through spark listener, but there is no spark API to get 
 job identifier(like job id) while submit spark job.
 I think other project whoever try to track job status with spark API would 
 suffer from this as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread hzw (JIRA)
hzw created SPARK-3277:
--

 Summary: LZ4 compression cause the the ExternalSort exception
 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: hzw
 Fix For: 1.1.0


I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  spark.shuffle.spill as false to avoid such exeception, but 
once open this switch, this error would come.
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3230) UDFs that return structs result in ClassCastException

2014-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3230.
-

Resolution: Fixed

 UDFs that return structs result in ClassCastException
 -

 Key: SPARK-3230
 URL: https://issues.apache.org/jira/browse/SPARK-3230
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3026) Provide a good error message if JDBC server is used but Spark is not compiled with -Pthriftserver

2014-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3026.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Provide a good error message if JDBC server is used but Spark is not compiled 
 with -Pthriftserver
 -

 Key: SPARK-3026
 URL: https://issues.apache.org/jira/browse/SPARK-3026
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.1.0


 Instead of giving a ClassNotFoundException we should detect this case and 
 just tell the user to build with -Phiveserver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3269) SparkSQLOperationManager.getNextRowSet OOMs when a large maxRows is set

2014-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3269:


Assignee: Cheng Lian

 SparkSQLOperationManager.getNextRowSet OOMs when a large maxRows is set
 ---

 Key: SPARK-3269
 URL: https://issues.apache.org/jira/browse/SPARK-3269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian

 {{SparkSQLOperationManager.getNextRowSet}} allocates an {{ArrayBuffer[Row]}} 
 as large as {{maxRows}}, which can lead to OOM if {{maxRows}} is large, even 
 if the actual size of the row set is much smaller.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3044) Create RSS feed for Spark News

2014-08-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3044:
---

Component/s: Project Infra

 Create RSS feed for Spark News
 --

 Key: SPARK-3044
 URL: https://issues.apache.org/jira/browse/SPARK-3044
 Project: Spark
  Issue Type: Documentation
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 Project updates are often posted here: http://spark.apache.org/news/
 Currently, there is no way to subscribe to a feed of these updates. It would 
 be nice there was a way people could be notified of new posts there without 
 having to check manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3278) Isotonic regression

2014-08-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3278:


 Summary: Isotonic regression
 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3279) Remove useless field variable in ApplicationMaster

2014-08-28 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3279:
-

 Summary: Remove useless field variable in ApplicationMaster
 Key: SPARK-3279
 URL: https://issues.apache.org/jira/browse/SPARK-3279
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta


ApplicationMaster no longer use ALLOCATE_HEARTBEAT_INTERVAL.
Let's remove it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2014-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3267:


Priority: Critical  (was: Major)
Target Version/s: 1.2.0
Assignee: Michael Armbrust

 Deadlock between ScalaReflectionLock and Data type initialization
 -

 Key: SPARK-3267
 URL: https://issues.apache.org/jira/browse/SPARK-3267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Michael Armbrust
Priority: Critical

 Deadlock here:
 {code}
 Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 
 nid=0x27a in Object.wait() [0x7fab60c2e000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:202)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
 a:175)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:304)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:314)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:313)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 ...
 {code}
 and
 {code}
 Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 
 nid=0x27e in Object.wait() [0x7fab0eeec000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
 - locked 0x00064e5d9a48 (a 
 org.apache.spark.sql.catalyst.expressions.Cast)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

[jira] [Commented] (SPARK-3279) Remove useless field variable in ApplicationMaster

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113505#comment-14113505
 ] 

Apache Spark commented on SPARK-3279:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2177

 Remove useless field variable in ApplicationMaster
 --

 Key: SPARK-3279
 URL: https://issues.apache.org/jira/browse/SPARK-3279
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 ApplicationMaster no longer use ALLOCATE_HEARTBEAT_INTERVAL.
 Let's remove it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3280:
--

 Summary: Made sort-based shuffle the default implementation
 Key: SPARK-3280
 URL: https://issues.apache.org/jira/browse/SPARK-3280
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin


sort-based shuffle has lower memory usage and seems to outperform hash-based in 
almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113519#comment-14113519
 ] 

Apache Spark commented on SPARK-3280:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2178

 Made sort-based shuffle the default implementation
 --

 Key: SPARK-3280
 URL: https://issues.apache.org/jira/browse/SPARK-3280
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 sort-based shuffle has lower memory usage and seems to outperform hash-based 
 in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1912) Compression memory issue during reduce

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113524#comment-14113524
 ] 

Apache Spark commented on SPARK-1912:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2179

 Compression memory issue during reduce
 --

 Key: SPARK-1912
 URL: https://issues.apache.org/jira/browse/SPARK-1912
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 0.9.2, 1.0.1, 1.1.0


 When we need to read a compressed block, we will first create a compress 
 stream instance(LZF or Snappy) and use it to wrap that block.
 Let's say a reducer task need to read 1000 local shuffle blocks, it will 
 first prepare to read that 1000 blocks, which means create 1000 compression 
 stream instance to wrap them. But the initialization of compression instance 
 will allocate some memory and when we have many compression instance at the 
 same time, it is a problem.
 Actually reducer reads the shuffle blocks one by one, so why we create 
 compression instance at the first time? Can we do it lazily that when a block 
 is first read, create compression instance for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3281) Remove Netty specific code in BlockManager

2014-08-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3281:
--

 Summary: Remove Netty specific code in BlockManager
 Key: SPARK-3281
 URL: https://issues.apache.org/jira/browse/SPARK-3281
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


Everything should go through the BlockTransferService interface rather than 
having conditional branches for Netty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3281) Remove Netty specific code in BlockManager

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113550#comment-14113550
 ] 

Apache Spark commented on SPARK-3281:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2181

 Remove Netty specific code in BlockManager
 --

 Key: SPARK-3281
 URL: https://issues.apache.org/jira/browse/SPARK-3281
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Everything should go through the BlockTransferService interface rather than 
 having conditional branches for Netty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3282) It should support multiple receivers at one socketInputDStream

2014-08-28 Thread shenhong (JIRA)
shenhong created SPARK-3282:
---

 Summary: It should support multiple receivers at one 
socketInputDStream 
 Key: SPARK-3282
 URL: https://issues.apache.org/jira/browse/SPARK-3282
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: shenhong


At present, a socketInputDStream support at most one receiver, it will be 
bottleneck when large inputStrem appear. 
It should support multiple receivers at one socketInputDStream 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3283) Receivers sometimes do not get spread out to multiple nodes

2014-08-28 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-3283:


 Summary: Receivers sometimes do not get spread out to multiple 
nodes
 Key: SPARK-3283
 URL: https://issues.apache.org/jira/browse/SPARK-3283
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


The probable reason this happens is because the JobGenerator and JobScheduler 
start generating jobs with tasks. When the ReceiverTracker submits the task 
containing receivers, the tasks get assigned according to empty slots, which 
may be instantaneously available on one node, instead of all the nodes. 

The original behavior was that the jobs started only after the receivers are 
started, thus ensuring that all the slots are free and the receivers are spread 
evenly across all the nodes. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-08-28 Thread Chengxiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113611#comment-14113611
 ] 

Chengxiang Li commented on SPARK-2633:
--

It's quite subjective I think, like Hive on MR display job progress by task 
finished percentage, while Hive on Tez display job progress with exact 
running/failed/finished task number. I think it's better we collect more detail 
job status info while it does not introduce much extra effort.

 enhance spark listener API to gather more spark job information
 ---

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Priority: Critical
  Labels: hive
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Based on Hive on Spark job status monitoring and statistic collection 
 requirement, try to enhance spark listener API to gather more spark job 
 information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3284) saveAsParquetFile not working on windows

2014-08-28 Thread Pravesh Jain (JIRA)
Pravesh Jain created SPARK-3284:
---

 Summary: saveAsParquetFile not working on windows
 Key: SPARK-3284
 URL: https://issues.apache.org/jira/browse/SPARK-3284
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Windows
Reporter: Pravesh Jain
Priority: Minor


object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster(local).setAppName(HdfsWordCount)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)

val parquetFile = 
sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
  }
}

gives the error



Exception in thread main java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3284) saveAsParquetFile not working on windows

2014-08-28 Thread Pravesh Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravesh Jain updated SPARK-3284:


Description: 
object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster(local).setAppName(HdfsWordCount)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)

val parquetFile = 
sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
  }
}

gives the error



Exception in thread main java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.

This works fine in linux but using in eclipse in windows gives the error.

  was:
object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster(local).setAppName(HdfsWordCount)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)

val parquetFile = 
sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
  }
}

gives the error



Exception in thread main java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.


 saveAsParquetFile not working on windows
 

 Key: SPARK-3284
 URL: https://issues.apache.org/jira/browse/SPARK-3284
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Windows
Reporter: Pravesh Jain
Priority: Minor

 object parquet {
   case class Person(name: String, age: Int)
   def main(args: Array[String]) {
 val sparkConf = new 
 SparkConf().setMaster(local).setAppName(HdfsWordCount)
 val sc = new SparkContext(sparkConf)
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
 import sqlContext.createSchemaRDD
 val people = 
 sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
  = Person(p(0), p(1).trim.toInt))
 
 people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
 val parquetFile = 
 sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
   }
 }
 gives the error
 Exception in thread main java.lang.NullPointerException at 
 org.apache.spark.parquet$.main(parquet.scala:16)
 which is the line saveAsParquetFile.
 This works fine in linux but using in eclipse in windows gives the error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-1647) Prevent data loss when Streaming driver goes down

2014-08-28 Thread Giulio De Vecchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giulio De Vecchi updated SPARK-1647:


Comment: was deleted

(was: Not sure if this make sense, but maybe would be nice to have a kind of 
flag available within the code that tells me if I'm running in a normal 
situation or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, thus I would like to know if my code 
is running for the first time or during a recovery so I can avoid to update the 
database again.

More generally I want to know this in case I'm interacting with external 
entities.

)

 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream

2014-08-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113704#comment-14113704
 ] 

Sean Owen commented on SPARK-3276:
--

Given the nature of a stream processing framework, when would you want to keep 
reprocessing all old data? that is something you can do, but, doesn't require 
Spark Streaming

 Provide a API to specify whether the old files need to be ignored in file 
 input text DStream
 

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Jack Hu

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream

2014-08-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113702#comment-14113702
 ] 

Sean Owen commented on SPARK-3274:
--

Same as the problem and solution in 
https://issues.apache.org/jira/browse/SPARK-1040

 Spark Streaming Java API reports java.lang.ClassCastException when calling 
 collectAsMap on JavaPairDStream
 --

 Key: SPARK-3274
 URL: https://issues.apache.org/jira/browse/SPARK-3274
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2
Reporter: Jack Hu

 Reproduce code:
 scontext
   .socketTextStream(localhost, 1)
   .mapToPair(new PairFunctionString, String, String(){
   public Tuple2String, String call(String arg0)
   throws Exception {
   return new Tuple2String, String(1, arg0);
   }
   })
   .foreachRDD(new Function2JavaPairRDDString, String, Time, 
 Void() {
   public Void call(JavaPairRDDString, String v1, Time 
 v2) throws Exception {
   System.out.println(v2.toString() + :  + 
 v1.collectAsMap().toString());
   return null;
   }
   });
 Exception:
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 [Lscala.Tupl
 e2;
 at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s
 cala:447)
 at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:
 464)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc
 V$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)

2014-08-28 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-3285:


 Summary: Using values.sum is easier to understand than using 
values.foldLeft(0)(_ + _)
 Key: SPARK-3285
 URL: https://issues.apache.org/jira/browse/SPARK-3285
 Project: Spark
  Issue Type: Test
  Components: Examples
Affects Versions: 1.0.2
Reporter: Yadong Qi


def sumB : A: B = foldLeft(num.zero)(num.plus)
Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), 
so we'd better use values.sum instead of values.foldLeft(0)( + _)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113710#comment-14113710
 ] 

Sean Owen commented on SPARK-3266:
--

The method is declared in the superclass, JavaRDDLike: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala#L538

You are running a different version of Spark than you are compiling with, and 
the runtime version is perhaps too old to contain this method. This is not a 
Spark issue.

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1
Reporter: Amey Chaugule

 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113713#comment-14113713
 ] 

Apache Spark commented on SPARK-3285:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2182

 Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
 -

 Key: SPARK-3285
 URL: https://issues.apache.org/jira/browse/SPARK-3285
 Project: Spark
  Issue Type: Test
  Components: Examples
Affects Versions: 1.0.2
Reporter: Yadong Qi

 def sumB : A: B = foldLeft(num.zero)(num.plus)
 Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), 
 so we'd better use values.sum instead of values.foldLeft(0)( + _)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113741#comment-14113741
 ] 

Mridul Muralidharan commented on SPARK-3277:


This looks like unrelated changes pushed to BlockObjectWriter as part of 
introduction of ShuffleWriteMetrics.
I had introducing checks and also documented that we must not infer size based 
on position of stream after flush - since close can write data to the streams 
(and one flush can result in more data getting generated which need not be 
flushed to streams).

Apparently this logic was modified subsequently causing this bug.
Solution would be to revert changes to update shuffleBytesWritten before close 
of stream.
It must be done after close and based on file.length

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: hzw
 Fix For: 1.1.0


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113769#comment-14113769
 ] 

Matthew Farrellee commented on SPARK-2855:
--

[~zhunansjtu] the link you supplied no longer works, please include the test 
failure in a comment on this jira

 pyspark test cases crashed for no reason
 

 Key: SPARK-2855
 URL: https://issues.apache.org/jira/browse/SPARK-2855
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Nan Zhu

 I met this for several times, 
 all scala/java test cases passed, but pyspark test cases just crashed
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113778#comment-14113778
 ] 

Nan Zhu commented on SPARK-2855:


[~joshrosen]?

 pyspark test cases crashed for no reason
 

 Key: SPARK-2855
 URL: https://issues.apache.org/jira/browse/SPARK-2855
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Nan Zhu

 I met this for several times, 
 all scala/java test cases passed, but pyspark test cases just crashed
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113776#comment-14113776
 ] 

Nan Zhu commented on SPARK-2855:


I guess they have fixed this.Jenkins side mistake?

 pyspark test cases crashed for no reason
 

 Key: SPARK-2855
 URL: https://issues.apache.org/jira/browse/SPARK-2855
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Nan Zhu

 I met this for several times, 
 all scala/java test cases passed, but pyspark test cases just crashed
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread hzw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hzw updated SPARK-3277:
---

Description: 
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  spark.shuffle.spill as false to avoid such exeception, but 
once open this switch, this error would come.
It seems that if num of the words is few, wordcount will go through,but if it 
is a complex text ,this problem will show
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  spark.shuffle.spill as false to avoid such exeception, but 
once open this switch, this error would come.
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)



 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: hzw
 Fix For: 1.1.0


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread hzw (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113789#comment-14113789
 ] 

hzw commented on SPARK-3277:


Sorry,I can not understand it clearly since I'm not familiar with the code of 
this class.
Can you point the line number of the code where it goes wrong or make a pr to 
fix this problem 

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: hzw
 Fix For: 1.1.0


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2435) Add shutdown hook to bin/pyspark

2014-08-28 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113816#comment-14113816
 ] 

Matthew Farrellee commented on SPARK-2435:
--

i couldn't find a PR for this, and it has been a problem for me, so i've created

https://github.com/apache/spark/pull/2183

 Add shutdown hook to bin/pyspark
 

 Key: SPARK-2435
 URL: https://issues.apache.org/jira/browse/SPARK-2435
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
Reporter: Andrew Or
Assignee: Josh Rosen
 Fix For: 1.1.0


 We currently never stop the SparkContext cleanly in bin/pyspark unless the 
 user explicitly runs sc.stop(). This behavior is not consistent with 
 bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting 
 the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2435) Add shutdown hook to bin/pyspark

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113911#comment-14113911
 ] 

Apache Spark commented on SPARK-2435:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2183

 Add shutdown hook to bin/pyspark
 

 Key: SPARK-2435
 URL: https://issues.apache.org/jira/browse/SPARK-2435
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
Reporter: Andrew Or
Assignee: Josh Rosen
 Fix For: 1.1.0


 We currently never stop the SparkContext cleanly in bin/pyspark unless the 
 user explicitly runs sc.stop(). This behavior is not consistent with 
 bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting 
 the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113957#comment-14113957
 ] 

Josh Rosen commented on SPARK-2855:
---

Do you recall the actual exception?  Was it a Py4J error (something like 
connection to GatewayServer failed?).  It seems like we've been experiencing 
some flakiness in these tests and I wonder whether it's due to some system 
resource being exhausted, such as ephemeral ports.

 pyspark test cases crashed for no reason
 

 Key: SPARK-2855
 URL: https://issues.apache.org/jira/browse/SPARK-2855
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Nan Zhu

 I met this for several times, 
 all scala/java test cases passed, but pyspark test cases just crashed
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113965#comment-14113965
 ] 

Nan Zhu commented on SPARK-2855:


no

https://github.com/apache/spark/pull/1313

search This particular failure was my fault,

 pyspark test cases crashed for no reason
 

 Key: SPARK-2855
 URL: https://issues.apache.org/jira/browse/SPARK-2855
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Nan Zhu

 I met this for several times, 
 all scala/java test cases passed, but pyspark test cases just crashed
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2855.
---

Resolution: Fixed

That issue should be fixed now, so I'm going to mark this JIRA as resolved.  
Feel free to re-open (or open a new issue) if you notice flaky PySpark tests.

 pyspark test cases crashed for no reason
 

 Key: SPARK-2855
 URL: https://issues.apache.org/jira/browse/SPARK-2855
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Nan Zhu

 I met this for several times, 
 all scala/java test cases passed, but pyspark test cases just crashed
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-28 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: spark-1297-v5.txt

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor
 Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
 spark-1297-v5.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113977#comment-14113977
 ] 

Ted Yu commented on SPARK-1297:
---

Patch v5 is the aggregate of the 4 commits in the pull request.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor
 Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
 spark-1297-v5.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-3277:
---

Priority: Blocker  (was: Major)

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Fix For: 1.1.0


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-3277:
---

Affects Version/s: 1.2.0
   1.1.0

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Fix For: 1.1.0


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114014#comment-14114014
 ] 

Mridul Muralidharan commented on SPARK-3277:


[~matei] Attaching a patch which reproduces the bug consistently.
I suspect the issue is more serious than what I detailed above - spill to disk 
seems completely broken if I understood the assertion message correctly.
Unfortunately, this is based on a few minutes of free time I could grab - so a 
more principled debugging session is definitely warranted !



 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Fix For: 1.1.0


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114022#comment-14114022
 ] 

Mridul Muralidharan edited comment on SPARK-3277 at 8/28/14 5:37 PM:
-

Attached patch is against master, though I noticed similar changes in 1.1 also 
: but not yet verified.


was (Author: mridulm80):
Against master, though I noticed similar changes in 1.1 also : but not yet 
verified.

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Fix For: 1.1.0

 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-3277:
---

Attachment: test_lz4_bug.patch

Against master, though I noticed similar changes in 1.1 also : but not yet 
verified.

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Fix For: 1.1.0

 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114026#comment-14114026
 ] 

Mridul Muralidharan commented on SPARK-3277:


[~hzw] did you notice this against 1.0.2 ?
I did not think the changes for consolidated shuffle were backported to that 
branch, [~mateiz] can comment more though.

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Fix For: 1.1.0

 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3150) NullPointerException in Spark recovery after simultaneous fall of master and driver

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3150.
---

   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1

 NullPointerException in Spark recovery after simultaneous fall of master and 
 driver
 ---

 Key: SPARK-3150
 URL: https://issues.apache.org/jira/browse/SPARK-3150
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment:  Linux 3.2.0-23-generic x86_64
Reporter: Tatiana Borisova
 Fix For: 1.1.1, 1.0.3


 The issue happens when Spark is run standalone on a cluster.
 When master and driver fall simultaneously on one node in a cluster, master 
 tries to recover its state and restart spark driver.
 While restarting driver, it falls with NPE exception (stacktrace is below).
 After falling, it restarts and tries to recover its state and restart Spark 
 driver again. It happens over and over in an infinite cycle.
 Namely, Spark tries to read DriverInfo state from zookeeper, but after 
 reading it happens to be null in DriverInfo.worker.
 Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
 2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
 java.lang.NullPointerException
 at 
 org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
 at 
 org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
 at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
 at 
 scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
 at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
 at 
 org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
 at 
 org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 How to reproduce: kill all Spark processes when running Spark standalone on a 
 cluster on some cluster node, where driver runs (kill driver, master and 
 worker simultaneously).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3277:
--

Fix Version/s: (was: 1.1.0)

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114055#comment-14114055
 ] 

Reynold Xin commented on SPARK-3280:


[~joshrosen] [~brkyvz] can you guys post the performance comparisons between 
sort vs hash shuffle in this ticket?

 Made sort-based shuffle the default implementation
 --

 Key: SPARK-3280
 URL: https://issues.apache.org/jira/browse/SPARK-3280
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 sort-based shuffle has lower memory usage and seems to outperform hash-based 
 in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3264) Allow users to set executor Spark home in Mesos

2014-08-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-3264.
--

Resolution: Fixed

 Allow users to set executor Spark home in Mesos
 ---

 Key: SPARK-3264
 URL: https://issues.apache.org/jira/browse/SPARK-3264
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Andrew Or

 There is an existing way to do this, through spark.home. However, this is 
 neither documented nor intuitive. I propose that we add a more specific 
 config spark.mesos.executor.home for this purpose, and fallback to the 
 existing settings if this is not set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2608) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)

2014-08-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2608:
-

Fix Version/s: 1.1.0

 Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other 
 things)
 ---

 Key: SPARK-2608
 URL: https://issues.apache.org/jira/browse/SPARK-2608
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: wangfei
Priority: Blocker
 Fix For: 1.1.0


 mesos scheduler backend use spark-class/spark-executor to launch executor 
 backend, this will lead to problems:
 1 when set spark.executor.extraJavaOptions CoarseMesosSchedulerBackend  will 
 throw error
 2 spark.executor.extraJavaOptions and spark.executor.extraLibraryPath set in 
 sparkconf will not be valid



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3264) Allow users to set executor Spark home in Mesos

2014-08-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3264:
-

Fix Version/s: 1.1.0

 Allow users to set executor Spark home in Mesos
 ---

 Key: SPARK-3264
 URL: https://issues.apache.org/jira/browse/SPARK-3264
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.1.0


 There is an existing way to do this, through spark.home. However, this is 
 neither documented nor intuitive. I propose that we add a more specific 
 config spark.mesos.executor.home for this purpose, and fallback to the 
 existing settings if this is not set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114092#comment-14114092
 ] 

Josh Rosen commented on SPARK-3280:
---

Here are some numbers from August 10.  If I recall, this was running on 8 
m3.8xlarge nodes.  This test linearly scales a bunch of parameters (data set 
size, numbers of mappers and reducers, etc).  You can see that hash-based 
shuffle's performance degrades severely in cases where we have many mappers and 
reducers, while sort scales much more gracefully:

!http://i.imgur.com/rODzaG1.png!

!http://i.imgur.com/72kCkH5.png!

This was run with spark-perf; here's a sample config for one of the bars:

{code}
Java options: -Dspark.storage.memoryFraction=0.66 
-Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
-Dspark.locality.wait=6000 
-Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 
--num-partitions=400 --reduce-tasks=400 --random-seed=5 
--persistent-type=memory  --num-records=2 --unique-keys=2 
--key-length=10 --unique-values=100 --value-length=10  
--storage-location=hdfs://:9000/spark-perf-kv-data
{code}

I'll try to run a better set of tests today.  I plan to look at a few cases 
that these tests didn't address, including the performance impact when running 
on spinning disks, as well as jobs where we have a large dataset with few 
mappers and reducers (I think this is the case that we'd expect to be most 
favorable to hash-based shuffle).

 Made sort-based shuffle the default implementation
 --

 Key: SPARK-3280
 URL: https://issues.apache.org/jira/browse/SPARK-3280
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 sort-based shuffle has lower memory usage and seems to outperform hash-based 
 in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114113#comment-14114113
 ] 

Ted Yu commented on SPARK-1297:
---

Here is sample command for building against 0.98 hbase:

mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 
-DskipTests clean package

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor
 Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
 spark-1297-v5.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-08-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114166#comment-14114166
 ] 

Joseph K. Bradley commented on SPARK-3272:
--

With respect to [SPARK-2207], I think this JIRA may or may not be necessary for 
implementing [SPARK-2207], depending on how the code is set up.  For 
[SPARK-2207], I imagined checking the number of instances and the information 
gain when the Node is constructed in the main loop (in the train() method).  If 
there are too few instances or too little information gain, then the Node will 
be set as a leaf.  We could potentially avoid the aggregation for those leafs, 
but I would consider that a separate issue ([SPARK-3158]).

 Calculate prediction for nodes separately from calculating information gain 
 for splits in decision tree
 ---

 Key: SPARK-3272
 URL: https://issues.apache.org/jira/browse/SPARK-3272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Qiping Li
 Fix For: 1.1.0


 In current implementation, prediction for a node is calculated along with 
 calculation of information gain stats for each possible splits. The value to 
 predict for a specific node is determined, no matter what the splits are.
 To save computation, we can first calculate prediction first and then 
 calculate information gain stats for each split.
 This is also necessary if we want to support minimum instances per node 
 parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
 because when all splits don't satisfy minimum instances requirement , we 
 don't use information gain of any splits. There should be a way to get the 
 prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2475) Check whether #cores #receivers in local mode

2014-08-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114204#comment-14114204
 ] 

Chris Fregly commented on SPARK-2475:
-

another option for the examples, specifically, is to default the number of 
local threads similar to to how the Kinesis example does it:  

https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L104

i get the number of shards in the given Kinesis stream and add 1.  the goal was 
to make this example work out of the box with little friction - even an error 
message can be discouraging.

for the other examples, we could just default to 2.  the advanced user can 
override if they want.  though i don't think i support an override in my 
kinesis example.  whoops!  :)

 Check whether #cores  #receivers in local mode
 ---

 Key: SPARK-2475
 URL: https://issues.apache.org/jira/browse/SPARK-2475
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das

 When the number of slots in local mode is not more than the number of 
 receivers, then the system should throw an error. Otherwise the system just 
 keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114247#comment-14114247
 ] 

Matei Zaharia commented on SPARK-3277:
--

Thanks Mridul -- I think Andrew and Patrick have figured this out.

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the words is few, wordcount will go through,but if it 
 is a complex text ,this problem will show
 Exeception Info as follow:
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-28 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3287:


Description: 
When ResourceManager High Availability is enabled, there will be multiple 
resource managers and each of them could act as a proxy.
AmIpFilter is modified to accept multiple proxy hosts. But Spark 
ApplicationMaster fails to read the ResourceManager IPs properly from the 
configuration.

So AmIpFilter is initialized with an empty set of proxy hosts. So any access to 
the ApplicationMaster WebUI will be redirected to port RM port on the local 
host. 


  was:
When ResourceManager High Availability is enabled, there will be multiple 
resource managers and each of them could act as a proxy.
AmIpFilter is modified to accept multiple proxy hosts. But Spark 
ApplicationMaster fails read the ResourceManager IPs properly from the 
configuration.

So AmIpFilter is initialized with an empty set of proxy hosts. So any access to 
the ApplicationMaster WebUI will be redirected to port RM port on the local 
host. 



 When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
 not displayed.
 

 Key: SPARK-3287
 URL: https://issues.apache.org/jira/browse/SPARK-3287
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3287.patch


 When ResourceManager High Availability is enabled, there will be multiple 
 resource managers and each of them could act as a proxy.
 AmIpFilter is modified to accept multiple proxy hosts. But Spark 
 ApplicationMaster fails to read the ResourceManager IPs properly from the 
 configuration.
 So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
 to the ApplicationMaster WebUI will be redirected to port RM port on the 
 local host. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2014-08-28 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3288:
--

 Summary: All fields in TaskMetrics should be private and use 
getters/setters
 Key: SPARK-3288
 URL: https://issues.apache.org/jira/browse/SPARK-3288
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or


This is particularly bad because we expose this as a developer API. Technically 
a library could create a TaskMetrics object and then change the values inside 
of it and pass it onto someone else. It can be written pretty compactly like 
below:

{code}
  /**
   * Number of bytes written for the shuffle by this task
   */
  @volatile private var _shuffleBytesWritten: Long = _
  def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value
  def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value
  def shuffleBytesWritten = _shuffleBytesWritten
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Colin B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin B. updated SPARK-3266:


Attachment: spark-repro-3266.tar.gz

I have attached a simple java project which reproduces the issue. 
[^spark-repro-3266.tar.gz]

{code}
 tar xvzf spark-repro-3266.tar.gz
...
 cd spark-repro-3266
 mvn clean package
 /path/to/spark-1.0.2-bin-hadoop2/bin/spark-submit --class SimpleApp 
 target/testcase-4-1.0.jar
...
Exception in thread main java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
at SimpleApp.main(SimpleApp.java:17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1
Reporter: Amey Chaugule
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-3266:
-

Assignee: Josh Rosen

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3281) Remove Netty specific code in BlockManager

2014-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3281.


   Resolution: Fixed
Fix Version/s: 1.2.0

 Remove Netty specific code in BlockManager
 --

 Key: SPARK-3281
 URL: https://issues.apache.org/jira/browse/SPARK-3281
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.2.0


 Everything should go through the BlockTransferService interface rather than 
 having conditional branches for Netty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)

2014-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3285.


   Resolution: Fixed
Fix Version/s: 1.2.0

 Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
 -

 Key: SPARK-3285
 URL: https://issues.apache.org/jira/browse/SPARK-3285
 Project: Spark
  Issue Type: Test
  Components: Examples
Affects Versions: 1.0.2
Reporter: Yadong Qi
 Fix For: 1.2.0


 def sumB : A: B = foldLeft(num.zero)(num.plus)
 Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), 
 so we'd better use values.sum instead of values.foldLeft(0)( + _)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3266:
--

Affects Version/s: 1.0.2

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114362#comment-14114362
 ] 

Josh Rosen commented on SPARK-3266:
---

Thanks for the reproduction!  I tried it myself and see the same issue.

If I replace 

{code}
JavaDoubleRDD javaDoubleRDD = sc.parallelizeDoubles(numbers);
{code}

with 

{code}
JavaRDDLikeDouble, ? javaDoubleRDD = sc.parallelizeDoubles(numbers);
{code}

then it seems to work.  I'll take a closer look using {{javap}} to see if I can 
figure out why this is happening.

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3183) Add option for requesting full YARN cluster

2014-08-28 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114368#comment-14114368
 ] 

Shay Rojansky commented on SPARK-3183:
--

+1.

As a current workaround for cores, we specify a number well beyond the YARN 
cluster capacity. This gets handled well by Spark/YARN, and we get the entire 
cluster.

 Add option for requesting full YARN cluster
 ---

 Key: SPARK-3183
 URL: https://issues.apache.org/jira/browse/SPARK-3183
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza

 This could possibly be in the form of --executor-cores ALL --executor-memory 
 ALL --num-executors ALL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Colin B. (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114387#comment-14114387
 ] 

Colin B. commented on SPARK-3266:
-

So there is no method:
{code}
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
{code}
but there is a method:
{code}
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Object;
{code}

I've heard that the return type is part of the type signature in java bytecode, 
so the two are different. (one returns a Double, the other an Object)

This looks a bit like a scala type erasure related issue. The spark/scala code 
generated for JavaRDDLike includes a max method that returns an object. In 
JavaDoubleRDD the type is bounded to Double, so java code which calls max on 
JavaDoubleRDD expects a method returning Double. Since the code for max is 
implemented in the JavaRDDLike trait, the java code doesn't seem to inherit it 
correctly when types are involved.

I tested making JavaRDDLike an abstract class instead of a trait. It was able 
to compile and run correctly. However it is not compatible with 1.0.2.

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114391#comment-14114391
 ] 

Sean Owen commented on SPARK-3266:
--

(Mea culpa! The example shows this is a legitimate question. I'll be quiet now.)

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Amey Chaugule (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114413#comment-14114413
 ] 

Amey Chaugule commented on SPARK-3266:
--

No worries, I initially assumed my runtime env was old too until i rechecked.

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2, 1.1.0
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3260) Yarn - pass acls along with executor launch

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114408#comment-14114408
 ] 

Apache Spark commented on SPARK-3260:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2185

 Yarn - pass acls along with executor launch
 ---

 Key: SPARK-3260
 URL: https://issues.apache.org/jira/browse/SPARK-3260
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 In https://github.com/apache/spark/pull/1196 I added passing the spark view 
 and modify acls into yarn.  Unfortunately we are only passing them into the 
 application master and I missed passing them in when we launch individual 
 containers (executors). 
 We need to modify the ExecutorRunnable.startContainer to set the acls in the 
 ContainerLaunchContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3266:
--

Affects Version/s: 1.1.0

JavaRDDLike probably should be an abstract class.  I think the current trait 
implementation was a holdover from an earlier prototype that attempted to 
achieve higher code reuse for operations like map() and filter().

I added a test case to JavaAPISuite that reproduces this issue on master, too.

The simplest solution is probably to make JavaRDDLike into a trait.  I think we 
can do this while maintaining source compatibility.  A less invasive but 
messier solution would be to just copy the implementation of max() and min() 
into each Java*RDD class and remove it from the trait.  

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2, 1.1.0
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3277:
---

Assignee: Andrew Or

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Assignee: Andrew Or
Priority: Blocker
 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the[ words is few, wordcount will go through,but if 
 it is a complex text ,this problem will show
 Exeception Info as follow:
 {code}
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3277:
---

Description: 
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  spark.shuffle.spill as false to avoid such exeception, but 
once open this switch, this error would come.
It seems that if num of the[ words is few, wordcount will go through,but if it 
is a complex text ,this problem will show
Exeception Info as follow:
{code}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}


  was:
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  spark.shuffle.spill as false to avoid such exeception, but 
once open this switch, this error would come.
It seems that if num of the words is few, wordcount will go through,but if it 
is a complex text ,this problem will show
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)



 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Priority: Blocker
 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the[ words is few, wordcount will go through,but if 
 it is a complex text ,this problem will show
 Exeception Info as follow:
 {code}
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 

[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: SPARK-3286.patch

Attaching the patch for the master

 Cannot view ApplicationMaster UI when Yarn’s url scheme is https
 

 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3286.patch, SPARK-3286.patch


 The spark Application Master starts its web UI at http://host-name:port.
 When Spark ApplicationMaster registers its URL with Resource Manager , the 
 URL does not contain URI scheme.
 If the URL scheme is absent, Resource Manager’s web app proxy will use the 
 HTTP Policy of the Resource Manager.(YARN-1553)
 If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
 try to access https://host-name:port.
 This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: SPARK-3286-branch-1-0.patch

 Cannot view ApplicationMaster UI when Yarn’s url scheme is https
 

 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch


 The spark Application Master starts its web UI at http://host-name:port.
 When Spark ApplicationMaster registers its URL with Resource Manager , the 
 URL does not contain URI scheme.
 If the URL scheme is absent, Resource Manager’s web app proxy will use the 
 HTTP Policy of the Resource Manager.(YARN-1553)
 If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
 try to access https://host-name:port.
 This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: (was: SPARK-3286.patch)

 Cannot view ApplicationMaster UI when Yarn’s url scheme is https
 

 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch


 The spark Application Master starts its web UI at http://host-name:port.
 When Spark ApplicationMaster registers its URL with Resource Manager , the 
 URL does not contain URI scheme.
 If the URL scheme is absent, Resource Manager’s web app proxy will use the 
 HTTP Policy of the Resource Manager.(YARN-1553)
 If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
 try to access https://host-name:port.
 This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114453#comment-14114453
 ] 

Apache Spark commented on SPARK-3266:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2186

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2, 1.1.0
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3190) Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow somewhere

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3190.
---

   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1
   1.2.0

Issue resolved by pull request 2106
[https://github.com/apache/spark/pull/2106]

 Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow 
 somewhere 
 ---

 Key: SPARK-3190
 URL: https://issues.apache.org/jira/browse/SPARK-3190
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.3
 Environment: Standalone mode running on EC2 . Using latest code from 
 master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
Reporter: npanj
Assignee: Ankur Dave
Priority: Critical
 Fix For: 1.2.0, 1.1.1, 1.0.3


 While creating a graph with 6B nodes and 12B edges, I noticed that 
 'numVertices' api returns incorrect result; 'numEdges' reports correct 
 number. For few times(with different dataset  2.5B nodes) I have also 
 notices that numVertices is returned as -ive number; so I suspect that there 
 is some overflow (may be we are using Int for some field?).
 Here is some details of experiments  I have done so far: 
 1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ;  numEdges=12163784626
 2. Input : numNodes=2157586441 ; noEdges=2747322705
Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
 3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ;  numEdges=2041768213
 You can find the code to generate this bug here: 
 https://gist.github.com/npanj/92e949d86d08715bf4bf
 Note: Nodes are labeled are 1...6B .
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3289) Prevent complete job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3289:
-

 Summary: Prevent complete job failures due to rescheduling of 
failing tasks on buggy machines
 Key: SPARK-3289
 URL: https://issues.apache.org/jira/browse/SPARK-3289
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen


Some users have reported issues where a task fails due to an environment / 
configuration issue on some machine, then the task is reattempted _on that same 
buggy machine_ until the entire job failures because that single task has 
failed too many times.

To guard against this, maybe we should add some randomization in how we 
reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3289:
--

Summary: Avoid job failures due to rescheduling of failing tasks on buggy 
machines  (was: Prevent complete job failures due to rescheduling of failing 
tasks on buggy machines)

 Avoid job failures due to rescheduling of failing tasks on buggy machines
 -

 Key: SPARK-3289
 URL: https://issues.apache.org/jira/browse/SPARK-3289
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen

 Some users have reported issues where a task fails due to an environment / 
 configuration issue on some machine, then the task is reattempted _on that 
 same buggy machine_ until the entire job failures because that single task 
 has failed too many times.
 To guard against this, maybe we should add some randomization in how we 
 reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114484#comment-14114484
 ] 

Mridul Muralidharan commented on SPARK-3277:


Sounds great, thx !
I suspect it is because for lzo we configure it to write block on flush 
(partial if insufficient data to fill block); but for lz4, either such config 
does not exist or we dont use that.
Resulting in flush becoming noop in case the data in current block is 
insufficientto cause a compressed block to be created - while close will force 
patial block to be written out.

Which is why the asserion lists all sizes as 0


 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Assignee: Andrew Or
Priority: Blocker
 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the[ words is few, wordcount will go through,but if 
 it is a complex text ,this problem will show
 Exeception Info as follow:
 {code}
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-08-28 Thread Qiping Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114591#comment-14114591
 ] 

Qiping Li commented on SPARK-3272:
--

Hi Joseph, thanks for your comment, I think checking the number of instances 
can't be done in the train() method because we don't know the number of 
instances for the leftSplit or rightSplit, for each split, we can only get 
information from InformationGainStats, which doesn't contain number of 
instances. In my implementation of SPARK-2207, the check is done in 
calculateGainForSplit, when the check fails, return a invalid information gain, 
the calculation of predict value may be skipped in that case. 

Maybe we can include number of instances for leftSplit and rightSplit in 
information gain stats and calculate predict value no matter whether check 
passes or not. I think either is fine for me.

 Calculate prediction for nodes separately from calculating information gain 
 for splits in decision tree
 ---

 Key: SPARK-3272
 URL: https://issues.apache.org/jira/browse/SPARK-3272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Qiping Li
 Fix For: 1.1.0


 In current implementation, prediction for a node is calculated along with 
 calculation of information gain stats for each possible splits. The value to 
 predict for a specific node is determined, no matter what the splits are.
 To save computation, we can first calculate prediction first and then 
 calculate information gain stats for each split.
 This is also necessary if we want to support minimum instances per node 
 parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
 because when all splits don't satisfy minimum instances requirement , we 
 don't use information gain of any splits. There should be a way to get the 
 prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114611#comment-14114611
 ] 

Mark Hamstra commented on SPARK-3289:
-

https://github.com/apache/spark/pull/1360

 Avoid job failures due to rescheduling of failing tasks on buggy machines
 -

 Key: SPARK-3289
 URL: https://issues.apache.org/jira/browse/SPARK-3289
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen

 Some users have reported issues where a task fails due to an environment / 
 configuration issue on some machine, then the task is reattempted _on that 
 same buggy machine_ until the entire job failures because that single task 
 has failed too many times.
 To guard against this, maybe we should add some randomization in how we 
 reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-28 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114612#comment-14114612
 ] 

Benoy Antony commented on SPARK-3287:
-

I'll submit a git pull request.

 When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
 not displayed.
 

 Key: SPARK-3287
 URL: https://issues.apache.org/jira/browse/SPARK-3287
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3287.patch


 When ResourceManager High Availability is enabled, there will be multiple 
 resource managers and each of them could act as a proxy.
 AmIpFilter is modified to accept multiple proxy hosts. But Spark 
 ApplicationMaster fails to read the ResourceManager IPs properly from the 
 configuration.
 So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
 to the ApplicationMaster WebUI will be redirected to port RM port on the 
 local host. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114614#comment-14114614
 ] 

Benoy Antony commented on SPARK-3286:
-

I'll submit a git pull request.

 Cannot view ApplicationMaster UI when Yarn’s url scheme is https
 

 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch


 The spark Application Master starts its web UI at http://host-name:port.
 When Spark ApplicationMaster registers its URL with Resource Manager , the 
 URL does not contain URI scheme.
 If the URL scheme is absent, Resource Manager’s web app proxy will use the 
 HTTP Policy of the Resource Manager.(YARN-1553)
 If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
 try to access https://host-name:port.
 This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3277.


Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/2187

Thanks to everyone who helped isolate and debug this.

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: hzw
Assignee: Andrew Or
Priority: Blocker
 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the[ words is few, wordcount will go through,but if 
 it is a complex text ,this problem will show
 Exeception Info as follow:
 {code}
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-2970.
---

Resolution: Fixed

 spark-sql script ends with IOException when EventLogging is enabled
 ---

 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta
Priority: Critical
 Fix For: 1.1.0


 When spark-sql script run with spark.eventLog.enabled set true, it ends with 
 IOException because FileLogger can not create APPLICATION_COMPLETE file in 
 HDFS.
 It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
 shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
 When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
 try to create a file to mark the application finished but the hook of 
 FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2961) Use statistics to skip partitions when reading from in-memory columnar data

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114668#comment-14114668
 ] 

Apache Spark commented on SPARK-2961:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2188

 Use statistics to skip partitions when reading from in-memory columnar data
 ---

 Key: SPARK-2961
 URL: https://issues.apache.org/jira/browse/SPARK-2961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114680#comment-14114680
 ] 

Patrick Wendell commented on SPARK-3266:


[~joshrosen] is there a solution here that preserves binary compatibility? 
That's been our goal at this point and we've maintained it by and large except 
for a few very minor mandatory Scala 2.11 upgrades.

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2, 1.1.0
Reporter: Amey Chaugule
Assignee: Josh Rosen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2636) Expose job ID in JobWaiter API

2014-08-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2636:
---

Summary: Expose job ID in JobWaiter API  (was: no where to get job 
identifier while submit spark job through spark API)

 Expose job ID in JobWaiter API
 --

 Key: SPARK-2636
 URL: https://issues.apache.org/jira/browse/SPARK-2636
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
  Labels: hive

 In Hive on Spark, we want to track spark job status through Spark API, the 
 basic idea is as following:
 # create an hive-specified spark listener and register it to spark listener 
 bus.
 # hive-specified spark listener generate job status by spark listener events.
 # hive driver track job status through hive-specified spark listener. 
 the current problem is that hive driver need job identifier to track 
 specified job status through spark listener, but there is no spark API to get 
 job identifier(like job id) while submit spark job.
 I think other project whoever try to track job status with spark API would 
 suffer from this as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2014-08-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3200:
---

Component/s: Spark Shell

 Class defined with reference to external variables crashes in REPL.
 ---

 Key: SPARK-3200
 URL: https://issues.apache.org/jira/browse/SPARK-3200
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma

 Reproducer:
 {noformat}
 val a = sc.textFile(README.md).count
 case class A(i: Int) { val j = a} 
 sc.parallelize(1 to 10).map(A(_)).collect()
 {noformat}
 This will happen, when one refers something that refers sc and not otherwise. 
 There are many ways to work around this, like directly assign a constant 
 value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3245) spark insert into hbase class not serialize

2014-08-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3245.


Resolution: Invalid

I'm closing this for now because we typically reported only isolated issues on 
the JIRA. Feel free to ping the spark user list for help narrowing down the 
issue.

 spark insert into hbase  class not serialize
 

 Key: SPARK-3245
 URL: https://issues.apache.org/jira/browse/SPARK-3245
 Project: Spark
  Issue Type: Bug
 Environment: spark-1.0.1 + hbase-0.96.2 + hadoop-2.2.0
Reporter: 刘勇

 val result: org.apache.spark.rdd.RDD[(String, Int)]
  result.foreach(res ={
   var put = new 
 Put(java.util.UUID.randomUUID().toString.reverse.getBytes())
.add(lv6.getBytes(), res._1.toString.getBytes(), 
 res._2.toString.getBytes)
   table.put(put) 
   } 
   )
 Exception in thread Thread-3 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task not serializable: java.io.NotSerializableException: 
 org.apache.hadoop.hbase.client.HTablePool$PooledHTable
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:901)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:898)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:898)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:898)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:897)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:897)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1226)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3234) SPARK_HADOOP_VERSION doesn't have a valid value by default in make-distribution.sh

2014-08-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3234:
---

Target Version/s: 1.2.0

 SPARK_HADOOP_VERSION doesn't have a valid value by default in 
 make-distribution.sh 
 ---

 Key: SPARK-3234
 URL: https://issues.apache.org/jira/browse/SPARK-3234
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Minor

 {{SPARK_HADOOP_VERSION}} has already been deprecated, but 
 {{make-distribution.sh}} uses it as part of the distribution tarball name. As 
 a result, we end up with something like {{spark-1.1.0-SNAPSHOT-bin-.tgz}} 
 because {{SPARK_HADOOP_VERSION}} is empty.
 A possible fix is to add the antrun plugin into the Maven build and run Maven 
 to print {{$hadoop.version}}. Instructions can be found in [this 
 post|http://www.avajava.com/tutorials/lessons/how-do-i-display-the-value-of-a-property.html].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3277:
-

Affects Version/s: (was: 1.2.0)

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: hzw
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.1.0

 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the[ words is few, wordcount will go through,but if 
 it is a complex text ,this problem will show
 Exeception Info as follow:
 {code}
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3277:
-

Fix Version/s: 1.1.0

 LZ4 compression cause the the ExternalSort exception
 

 Key: SPARK-3277
 URL: https://issues.apache.org/jira/browse/SPARK-3277
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: hzw
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.1.0

 Attachments: test_lz4_bug.patch


 I tested the LZ4 compression,and it come up with such problem.(with wordcount)
 Also I tested the snappy and LZF,and they were OK.
 At last I set the  spark.shuffle.spill as false to avoid such exeception, 
 but once open this switch, this error would come.
 It seems that if num of the[ words is few, wordcount will go through,but if 
 it is a complex text ,this problem will show
 Exeception Info as follow:
 {code}
 java.lang.AssertionError: assertion failed
 at scala.Predef$.assert(Predef.scala:165)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
 at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2014-08-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3288:
-

Affects Version/s: 1.1.0

 All fields in TaskMetrics should be private and use getters/setters
 ---

 Key: SPARK-3288
 URL: https://issues.apache.org/jira/browse/SPARK-3288
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 This is particularly bad because we expose this as a developer API. 
 Technically a library could create a TaskMetrics object and then change the 
 values inside of it and pass it onto someone else. It can be written pretty 
 compactly like below:
 {code}
   /**
* Number of bytes written for the shuffle by this task
*/
   @volatile private var _shuffleBytesWritten: Long = _
   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
 value
   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
 value
   def shuffleBytesWritten = _shuffleBytesWritten
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...

2014-08-28 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114729#comment-14114729
 ] 

Michael Armbrust commented on SPARK-2594:
-

Its a lot of overhead to assign issues to people, but feel free to work on this 
now that you have posted here.  Please post a design here before you begin 
coding.

 Add CACHE TABLE name AS SELECT ...
 

 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-08-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114742#comment-14114742
 ] 

Joseph K. Bradley commented on SPARK-3272:
--

Hi Qiping, you are right; I missed that!  I like your idea of storing the 
number of instances in the InformationGainStats.  (That seems easier to 
understand than a special invalid gain value.)  For now, I would recommend 
storing the number for the node, not for the left  right child nodes.  That 
would allow you to decide if the node being considered is a leaf (not its 
children).

I agree that, eventually, we should identify if the children are leafs at the 
same time.  That should be part of [SPARK-3158], which could modify 
findBestSplits to return ImpurityCalculators (a new class from my PR 
[https://github.com/apache/spark/pull/2125]) for the left and right child 
nodes.  Does that sound reasonable?

 Calculate prediction for nodes separately from calculating information gain 
 for splits in decision tree
 ---

 Key: SPARK-3272
 URL: https://issues.apache.org/jira/browse/SPARK-3272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Qiping Li
 Fix For: 1.1.0


 In current implementation, prediction for a node is calculated along with 
 calculation of information gain stats for each possible splits. The value to 
 predict for a specific node is determined, no matter what the splits are.
 To save computation, we can first calculate prediction first and then 
 calculate information gain stats for each split.
 This is also necessary if we want to support minimum instances per node 
 parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
 because when all splits don't satisfy minimum instances requirement , we 
 don't use information gain of any splits. There should be a way to get the 
 prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3291) TestcaseName in createQueryTest should not contain :

2014-08-28 Thread Qiping Li (JIRA)
Qiping Li created SPARK-3291:


 Summary: TestcaseName in createQueryTest should not contain :
 Key: SPARK-3291
 URL: https://issues.apache.org/jira/browse/SPARK-3291
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Qiping Li


: is not allowed to appear in a file name of Windows system. If file name 
contains :, this file can't be checked out in a Windows system and developers 
using Windows must be careful to not commit the deletion of such files, Which 
is very inconvenient. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-08-28 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114789#comment-14114789
 ] 

Erik Erlandson commented on SPARK-3250:
---

I did some experiments with sampling that models the gaps between samples (so 
one can use iterator.drop between samples).  The results are here:

https://gist.github.com/erikerlandson/66b42d96500589f25553

There appears to be a crossover point in efficiency, around sampling 
probability p=0.3, where densities below 0.3 are best done using the new logic, 
and higher sampling densities are better done using traditional filter-based 
logic.

I need to run more tests, but the first results are promising.  At low sampling 
densities the improvement is large.

 More Efficient Sampling
 ---

 Key: SPARK-3250
 URL: https://issues.apache.org/jira/browse/SPARK-3250
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: RJ Nowling

 Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
 number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
 sampling, where k is the number of data points to sample.  Examples of such 
 algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
 Descent with mini batching.
 More efficient sampling may be achievable by packing partitions with an 
 ArrayBuffer or other data structure supporting random access.  Since many of 
 these stochastic algorithms perform repeated rounds of sampling, it may be 
 feasible to perform a transformation to change the backing data structure 
 followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3292) Shuffle Tasks run indefinitely even though there's no inputs

2014-08-28 Thread guowei (JIRA)
guowei created SPARK-3292:
-

 Summary: Shuffle Tasks run indefinitely even though there's no 
inputs
 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei


such as repartition groupby join and cogroup
it's too expensive , for example if i want outputs save as hadoop file ,then 
many emtpy file generate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >