[jira] [Commented] (SPARK-2982) Glitch of spark streaming
[ https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093816#comment-14093816 ] dai zhiyuan commented on SPARK-2982: [~srowen] Please see the attached file. > Glitch of spark streaming > - > > Key: SPARK-2982 > URL: https://issues.apache.org/jira/browse/SPARK-2982 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: dai zhiyuan > Attachments: cpu.png, io.png, network.png > > > spark streaming task startup time is very focused,It creates a problem which > is glitch of (network and cpu) , and cpu and network is in an idle state at > lot of time,which is wasteful for system resources. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2981) PartitionStrategy: VertexID hash overflow
[ https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093813#comment-14093813 ] Apache Spark commented on SPARK-2981: - User 'larryxiao' has created a pull request for this issue: https://github.com/apache/spark/pull/1902 > PartitionStrategy: VertexID hash overflow > - > > Key: SPARK-2981 > URL: https://issues.apache.org/jira/browse/SPARK-2981 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with > a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. > The Long is overflowed, and when cast to Int: > {quote} > scala> (1125899906842597L*1).toInt > res1: Int = -27 > scala> (1125899906842597L*2).toInt > res2: Int = -54 > scala> (1125899906842597L*3).toInt > res3: Int = -81 > {quote} > As the cast produce number that are multiplies of 3, the partition is not > useable when partitioning to multiples of 3. > for example when you partition to 6 or 9 parts: > {quote} > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > so the vertices are partitioned to 0,3 for 6; and 0 for 9 > {quote} > I think solution is to cast after mod. > {quote} > scala> (1125899906842597L*3) > res4: Long = 3377699720527791 > scala> (1125899906842597L*3) % 9 > res5: Long = 3 > scala> ((1125899906842597L*3) % 9).toInt > res5: Int = 3 > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow
[ https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-2981: -- Description: In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} was: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} > PartitionStrategy: VertexID hash overflow > - > > Key: SPARK-2981 > URL: https://issues.apache.org/jira/browse/SPARK-2981 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with > a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. > The Long is overflowed, and when cast to Int: > {quote} > scala> (1125899906842597L*1).toInt > res1: Int = -27 > scala> (1125899906842597L*2).toInt > res2: Int = -54 > scala> (1125899906842597L*3).toInt > res3: Int = -81 > {quote} > As the cast produce number that are multiplies of 3, the partition is not > useable when partitioning to multiples of 3. > for example when you partition to 6 or 9 parts: > {quote} > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > so the vertices are partitioned to 0,3 for 6; and 0 for 9 > {quote} > I think solution is to cast after mod. > {quote} > scala> (1125899906842597L*3) > res4: Long = 3377699720527791 > scala> (1125899906842597L*3) % 9 > res5: Long = 3 > scala> ((1125899906842597L*3) % 9).toInt > res5: Int = 3 > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2982) Glitch of spark streaming
[ https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dai zhiyuan updated SPARK-2982: --- Attachment: network.png io.png cpu.png > Glitch of spark streaming > - > > Key: SPARK-2982 > URL: https://issues.apache.org/jira/browse/SPARK-2982 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: dai zhiyuan > Attachments: cpu.png, io.png, network.png > > > spark streaming task startup time is very focused,It creates a problem which > is glitch of (network and cpu) , and cpu and network is in an idle state at > lot of time,which is wasteful for system resources. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2982) Glitch of spark streaming
[ https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dai zhiyuan updated SPARK-2982: --- Description: spark streaming task startup time is very focused,It creates a problem which is glitch of (network and cpu) , and cpu and network is in an idle state at lot of time,which is wasteful for system resources. (was: spark streaming task startup time is very focused,It creates a problem which is network and cpu glitch, and cpu and network is in an idle state at lot of time,which is very wasteful for system resources.) > Glitch of spark streaming > - > > Key: SPARK-2982 > URL: https://issues.apache.org/jira/browse/SPARK-2982 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: dai zhiyuan > > spark streaming task startup time is very focused,It creates a problem which > is glitch of (network and cpu) , and cpu and network is in an idle state at > lot of time,which is wasteful for system resources. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2982) Glitch of spark streaming
[ https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093794#comment-14093794 ] Sean Owen commented on SPARK-2982: -- I find it hard to understand the problem or solution that this is attempting to describe. Please provide a lot more clear detail? > Glitch of spark streaming > - > > Key: SPARK-2982 > URL: https://issues.apache.org/jira/browse/SPARK-2982 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: dai zhiyuan > > spark streaming task startup time is very focused,It creates a problem which > is network and cpu glitch, and cpu and network is in an idle state at lot of > time,which is very wasteful for system resources. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2985) BlockGenerator not available.
dai zhiyuan created SPARK-2985: -- Summary: BlockGenerator not available. Key: SPARK-2985 URL: https://issues.apache.org/jira/browse/SPARK-2985 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: dai zhiyuan Priority: Critical If recevierTracker crashes,the buffer data of BlockGenerator will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2650) Caching tables larger than memory causes OOMs
[ https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093785#comment-14093785 ] Apache Spark commented on SPARK-2650: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/1901 > Caching tables larger than memory causes OOMs > - > > Key: SPARK-2650 > URL: https://issues.apache.org/jira/browse/SPARK-2650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0, 1.0.1 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.1.0 > > > The logic for setting up the initial column buffers is different for Spark > SQL compared to Shark and I'm seeing OOMs when caching tables that are larger > than available memory (where shark was okay). > Two suspicious things: the intialSize is always set to 0 so we always go with > the default. The default looks like it was copied from code like 10 * 1024 * > 1024... but in Spark SQL its 10 * 102 * 1024. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-2984: -- Description: We've seen several stacktraces and threads on the user mailing list where people are having issues with a FileNotFoundException stemming from an HDFS path containing _temporary. I think this may be related to spark.speculation. I think the error condition might manifest in this circumstance: 1) task T starts on a executor E1 2) it takes a long time, so task T' is started on another executor E2 3) T finishes in E1 so moves its data from _temporary to the final destination and deletes the _temporary directory during cleanup 4) T' finishes in E2 and attempts to move its data from _temporary, but those files no longer exist! exception Some samples: {noformat} 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 140774430 ms.0 java.io.FileNotFoundException: File hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) at org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) at org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} -- Chen Song at http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html {noformat} I am running a Spark Streaming job that uses saveAsTextFiles to save results into hdfs files. However, it has an exception after 20 batches result-140631234/_temporary/0/task_201407251119__m_03 does not exist. {noformat} and {noformat} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.ja
[jira] [Created] (SPARK-2984) FileNotFoundException on _temporary directory
Andrew Ash created SPARK-2984: - Summary: FileNotFoundException on _temporary directory Key: SPARK-2984 URL: https://issues.apache.org/jira/browse/SPARK-2984 Project: Spark Issue Type: Bug Reporter: Andrew Ash Priority: Critical We've seen several stacktraces and threads on the user mailing list where people are having issues with a FileNotFoundException stemming from an HDFS path containing _temporary. I think this may be related to spark.speculation. I think the error condition might manifest in this circumstance: 1) task T starts on a executor E1 2) it takes a long time, so task T' is started on another executor E2 3) T finishes in E1 so moves its data from _temporary to the final destination and deletes the _temporary directory during cleanup 4) T' finishes in E2 and attempts to move its data from _temporary, but those files no longer exist! exception Some samples: {noformat} 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 140774430 ms.0 java.io.FileNotFoundException: File hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) at org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) at org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} -- Chen Song at http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html {noformat} I am running a Spark Streaming job that uses saveAsTextFiles to save results into hdfs files. However, it has an exception after 20 batches result-140631234/_temporary/0/task_201407251119__m_03 does not exist. {noformat} and {noformat} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.a
[jira] [Created] (SPARK-2983) improve performance of sortByKey()
Davies Liu created SPARK-2983: - Summary: improve performance of sortByKey() Key: SPARK-2983 URL: https://issues.apache.org/jira/browse/SPARK-2983 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.2, 0.9.0, 1.1.0 Reporter: Davies Liu For large datasets with many partitions (N), sortByKey() will be very slow, because it will take O(N) time in rangePartitioner. This could be improved by using binary search, the time will be reduced to O(logN). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2923) Implement some basic linalg operations in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2923. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1849 [https://github.com/apache/spark/pull/1849] > Implement some basic linalg operations in MLlib > --- > > Key: SPARK-2923 > URL: https://issues.apache.org/jira/browse/SPARK-2923 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.1.0 > > > We use breeze for linear algebra operations. Breeze operations are > user-friendly but there are some concerns: > 1. creating temp objects, e.g., `val z = a * x + b * y` > 2. multi-method is not used in some operators, e.g., `axpy`. If we pass in > SparseVector as a generic Vector, it will use activeIterator, which is slow > 3. calling native BLAS if it is available, which might not be good for > level-1 methods > Having some basic BLAS operations implemented in MLlib can help simplify the > current implementation and improve some performance. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093746#comment-14093746 ] Jianshi Huang commented on SPARK-2890: -- My use case: The result will be parsed into (id, type, start, end, properties) tuples. Properties might or might not contain any of (id, type, start end). So it's easier just to list them at the end and not to worry about duplicated names. Jianshi > Spark SQL should allow SELECT with duplicated columns > - > > Key: SPARK-2890 > URL: https://issues.apache.org/jira/browse/SPARK-2890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Jianshi Huang > > Spark reported error java.lang.IllegalArgumentException with messages: > java.lang.IllegalArgumentException: requirement failed: Found fields with the > same name. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.sql.catalyst.types.StructType.(dataTypes.scala:317) > at > org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) > at > org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) > at > org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) > After trial and error, it seems it's caused by duplicated columns in my > select clause. > I made the duplication on purpose for my code to parse correctly. I think we > should allow users to specify duplicated columns as return value. > Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2982) Glitch of spark streaming
dai zhiyuan created SPARK-2982: -- Summary: Glitch of spark streaming Key: SPARK-2982 URL: https://issues.apache.org/jira/browse/SPARK-2982 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: dai zhiyuan spark streaming task startup time is very focused,It creates a problem which is network and cpu glitch, and cpu and network is in an idle state at lot of time,which is very wasteful for system resources. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer
[ https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2934. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1862 [https://github.com/apache/spark/pull/1862] > Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer > -- > > Key: SPARK-2934 > URL: https://issues.apache.org/jira/browse/SPARK-2934 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2826) Reduce the Memory Copy for HashOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2826. - Resolution: Fixed Fix Version/s: 1.1.0 > Reduce the Memory Copy for HashOuterJoin > > > Key: SPARK-2826 > URL: https://issues.apache.org/jira/browse/SPARK-2826 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > Fix For: 1.1.0 > > > This is actually a follow up for > https://issues.apache.org/jira/browse/SPARK-2212 , the previous > implementation has potential memory copy. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow
[ https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-2981: -- Description: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} was: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} > PartitionStrategy: VertexID hash overflow > - > > Key: SPARK-2981 > URL: https://issues.apache.org/jira/browse/SPARK-2981 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > In PartitionStrategy.scala a PartitionID is calculated by multiplying > VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod > numParts. > The Long is overflowed, and when cast to Int: > {quote} > scala> (1125899906842597L*1).toInt > res1: Int = -27 > scala> (1125899906842597L*2).toInt > res2: Int = -54 > scala> (1125899906842597L*3).toInt > res3: Int = -81 > {quote} > As the cast produce number that are multiplies of 3, the partition is not > useable when partitioning to multiples of 3. > for example when you partition to 6 or 9 parts: > {quote} > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > so the vertices are partitioned to 0,3 for 6; and 0 for 9 > {quote} > I think solution is to cast after mod. > {quote} > scala> (1125899906842597L*3) > res4: Long = 3377699720527791 > scala> (1125899906842597L*3) % 9 > res5: Long = 3 > scala> ((1125899906842597L*3) % 9).toInt > res5: Int = 3 > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2981) PartitionStrategy: VertexID hash overflow
Larry Xiao created SPARK-2981: - Summary: PartitionStrategy: VertexID hash overflow Key: SPARK-2981 URL: https://issues.apache.org/jira/browse/SPARK-2981 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2650) Caching tables larger than memory causes OOMs
[ https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2650. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Michael Armbrust (was: Cheng Lian) Target Version/s: 1.1.0 (was: 1.2.0) > Caching tables larger than memory causes OOMs > - > > Key: SPARK-2650 > URL: https://issues.apache.org/jira/browse/SPARK-2650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0, 1.0.1 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.1.0 > > > The logic for setting up the initial column buffers is different for Spark > SQL compared to Shark and I'm seeing OOMs when caching tables that are larger > than available memory (where shark was okay). > Two suspicious things: the intialSize is always set to 0 so we always go with > the default. The default looks like it was copied from code like 10 * 1024 * > 1024... but in Spark SQL its 10 * 102 * 1024. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2968) Fix nullabilities of Explode.
[ https://issues.apache.org/jira/browse/SPARK-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2968. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin > Fix nullabilities of Explode. > - > > Key: SPARK-2968 > URL: https://issues.apache.org/jira/browse/SPARK-2968 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0 > > > Output nullabilities of {{Explode}} could be detemined by > {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2965) Fix HashOuterJoin output nullabilities.
[ https://issues.apache.org/jira/browse/SPARK-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2965. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin > Fix HashOuterJoin output nullabilities. > --- > > Key: SPARK-2965 > URL: https://issues.apache.org/jira/browse/SPARK-2965 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0 > > > Output attributes of opposite side of {{OuterJoin}} should be nullable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2590) Add config property to disable incremental collection used in Thrift server
[ https://issues.apache.org/jira/browse/SPARK-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2590. - Resolution: Fixed Fix Version/s: 1.1.0 > Add config property to disable incremental collection used in Thrift server > --- > > Key: SPARK-2590 > URL: https://issues.apache.org/jira/browse/SPARK-2590 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.1.0 > > > {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the > result set one partition at a time. This is useful to avoid OOM when the > result is large, but introduces extra job scheduling costs as each partition > is collected with a separate job. Users may want to disable this when the > result set is expected to be small. > *UPDATE* Incremental collection hurts performance because tasks of the last > stage of the RDD DAG generated from the SQL query plan are executed > sequentially. Thus we decided to disable it by default. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2844) Existing JVM Hive Context not correctly used in Python Hive Context
[ https://issues.apache.org/jira/browse/SPARK-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2844. - Resolution: Fixed Fix Version/s: 1.1.0 > Existing JVM Hive Context not correctly used in Python Hive Context > --- > > Key: SPARK-2844 > URL: https://issues.apache.org/jira/browse/SPARK-2844 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Ahir Reddy >Assignee: Ahir Reddy > Fix For: 1.1.0 > > > Unlike the SQLContext, assing an existing JVM HiveContext object into the > Python HiveContext constructor does not actually re-use that object. Instead > it will create a new HiveContext. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer
[ https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2934: - Assignee: DB Tsai > Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer > -- > > Key: SPARK-2934 > URL: https://issues.apache.org/jira/browse/SPARK-2934 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2980) Python support for chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2980: - Assignee: (was: Doris Xin) > Python support for chi-squared test > --- > > Key: SPARK-2980 > URL: https://issues.apache.org/jira/browse/SPARK-2980 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Doris Xin > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2980) Python support for chi-squared test
Doris Xin created SPARK-2980: Summary: Python support for chi-squared test Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2515) Chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2515: - Summary: Chi-squared test (was: Hypothesis testing) > Chi-squared test > > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > Fix For: 1.1.0 > > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2515) Chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-2515. Resolution: Implemented Target Version/s: 1.1.0 > Chi-squared test > > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > Fix For: 1.1.0 > > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2515: - Fix Version/s: 1.1.0 > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > Fix For: 1.1.0 > > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS
[ https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-2979: --- Summary: Improve the convergence rate by minimizing the condition number in LOR with LBFGS (was: Improve the convergence rate by minimize the condition number in LOR with LBFGS) > Improve the convergence rate by minimizing the condition number in LOR with > LBFGS > - > > Key: SPARK-2979 > URL: https://issues.apache.org/jira/browse/SPARK-2979 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai > > Scaling to minimize the condition number: > > During the optimization process, the convergence (rate) depends on the > condition number of the training dataset. Scaling the variables often reduces > this condition number, thus mproving the convergence rate dramatically. > Without reducing the condition number, some training datasets mixing the > columns with different scales may not be able to converge. > > GLMNET and LIBSVM packages perform the scaling to reduce the condition > number, and return the weights in the original scale. > See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf > > Here, if useFeatureScaling is enabled, we will standardize the training > features by dividing the variance of each column (without subtracting the > mean), and train the model in the scaled space. Then we transform the > coefficients from the scaled space to the original scale as GLMNET and LIBSVM > do. > > Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS
[ https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093604#comment-14093604 ] Apache Spark commented on SPARK-2979: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/1897 > Improve the convergence rate by minimize the condition number in LOR with > LBFGS > --- > > Key: SPARK-2979 > URL: https://issues.apache.org/jira/browse/SPARK-2979 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai > > Scaling to minimize the condition number: > > During the optimization process, the convergence (rate) depends on the > condition number of the training dataset. Scaling the variables often reduces > this condition number, thus mproving the convergence rate dramatically. > Without reducing the condition number, some training datasets mixing the > columns with different scales may not be able to converge. > > GLMNET and LIBSVM packages perform the scaling to reduce the condition > number, and return the weights in the original scale. > See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf > > Here, if useFeatureScaling is enabled, we will standardize the training > features by dividing the variance of each column (without subtracting the > mean), and train the model in the scaled space. Then we transform the > coefficients from the scaled space to the original scale as GLMNET and LIBSVM > do. > > Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS
DB Tsai created SPARK-2979: -- Summary: Improve the convergence rate by minimize the condition number in LOR with LBFGS Key: SPARK-2979 URL: https://issues.apache.org/jira/browse/SPARK-2979 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-2978: -- Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition > Provide an MR-style shuffle transformation > -- > > Key: SPARK-2978 > URL: https://issues.apache.org/jira/browse/SPARK-2978 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Sandy Ryza > > For Hive on Spark joins in particular, and for running legacy MR code in > general, I think it would be useful to provide a transformation with the > semantics of the Hadoop MR shuffle, i.e. one that > * groups by key: provides (Key, Iterator[Value]) > * within each partition, provides keys in sorted order > A couple ways that could make sense to expose this: > * Add a new operator. "groupAndSortByKey", > "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" > * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-2978: -- Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe? * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition > Provide an MR-style shuffle transformation > -- > > Key: SPARK-2978 > URL: https://issues.apache.org/jira/browse/SPARK-2978 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Sandy Ryza > > For Hive on Spark joins in particular, and for running legacy MR code in > general, I think it would be useful to provide a transformation with the > semantics of the Hadoop MR shuffle, i.e. one that > * groups by key: provides (Key, Iterator[Value]) > * within each partition, provides keys in sorted order > A couple ways that could make sense to expose this: > * Add a new operator. "groupAndSortByKey", > "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe? > * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-2978: -- Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark in particular, and running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition > Provide an MR-style shuffle transformation > -- > > Key: SPARK-2978 > URL: https://issues.apache.org/jira/browse/SPARK-2978 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Sandy Ryza > > For Hive on Spark joins in particular, and for running legacy MR code in > general, I think it would be useful to provide an MR-style shuffle > transformation, i.e. one that > * groups by key: provides (Key, Iterator[Value]) > * within each partition, provides keys in sorted order > A couple ways that could make sense to expose this: > * Add a new operator. "groupAndSortByKey", > "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" > * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2978) Provide an MR-style shuffle transformation
Sandy Ryza created SPARK-2978: - Summary: Provide an MR-style shuffle transformation Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark in particular, and running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2975) SPARK_LOCAL_DIRS may cause problems when running in local mode
[ https://issues.apache.org/jira/browse/SPARK-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2975: -- Priority: Critical (was: Minor) I'm raising the priority of this issue to 'critical', since it causes problems when running on a cluster if some tasks are small enough to be run locally on the driver. Here's an example exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 0.0 failed 1 times, most recent failure: Lost task 21.0 in stage 0.0 (TID 21, localhost): java.io.IOException: No such file or directory java.io.UnixFileSystem.createFileExclusively(Native Method) java.io.File.createNewFile(File.java:1006) java.io.File.createTempFile(File.java:1989) org.apache.spark.util.Utils$.fetchFile(Utils.scala:335) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:342) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:340) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) scala.collection.mutable.HashMap.foreach(HashMap.scala:98) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:340) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} > SPARK_LOCAL_DIRS may cause problems when running in local mode > -- > > Key: SPARK-2975 > URL: https://issues.apache.org/jira/browse/SPARK-2975 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Josh Rosen >Priority: Critical > > If we're running Spark in local mode and {{SPARK_LOCAL_DIRS}} is set, the > {{Executor}} modifies SparkConf so that this value overrides > {{spark.local.dir}}. Normally, this is safe because the modification takes > place before SparkEnv is created. In local mode, the Executor uses an > existing SparkEnv rather than creating a new one, so it winds up with a > DiskBlockManager that created local directories with the original > {{spark.local.dir}} setting, but other components attempt to use directories > specified in the _new_ {{
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093468#comment-14093468 ] Sean Owen commented on SPARK-1297: -- Yes I think you'd need to reflect that in changes to the build instructions. They are under docs/ > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093466#comment-14093466 ] Ted Yu commented on SPARK-1297: --- w.r.t. build, by default, hbase-hadoop1 would be used. If user specifies any of the hadoop-2 profiles, hbase-hadoop2 should be specified as well. > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093413#comment-14093413 ] Vlad Frolov commented on SPARK-1065: I am facing the same issue in my project, where I use PySpark. As a proof of that the big objects I have could easily fit into nodes' memory, I am going to use dummy solution of saving my big objects into HDFS and load them on Python nodes. Does anybody have an idea how to fix the issue in a better way? I don't have enough either Scala nor Java knowledge to fix this in Spark core. However, I feel like broadcast variables could be reimplemented on Python side though it seems a bit dangerous idea because we don't want to have separate implementations of one thing in both languages. That will also save memory, because while we use broadcasts through Scala we have 1 copy in JVM, 1 pickled copy in Python and 1 constructed object copy in Python. > PySpark runs out of memory with large broadcast variables > - > > Key: SPARK-1065 > URL: https://issues.apache.org/jira/browse/SPARK-1065 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.7.3, 0.8.1, 0.9.0 >Reporter: Josh Rosen > > PySpark's driver components may run out of memory when broadcasting large > variables (say 1 gigabyte). > Because PySpark's broadcast is implemented on top of Java Spark's broadcast > by broadcasting a pickled Python as a byte array, we may be retaining > multiple copies of the large object: a pickled copy in the JVM and a > deserialized copy in the Python driver. > The problem could also be due to memory requirements during pickling. > PySpark is also affected by broadcast variables not being garbage collected. > Adding an unpersist() method to broadcast variables may fix this: > https://github.com/apache/incubator-spark/pull/543. > As a first step to fixing this, we should write a failing test to reproduce > the error. > This was discovered by [~sandy]: ["trouble with broadcast variables on > pyspark"|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093295#comment-14093295 ] Apache Spark commented on SPARK-2931: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/1896 > getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException > --- > > Key: SPARK-2931 > URL: https://issues.apache.org/jira/browse/SPARK-2931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf > benchmark >Reporter: Josh Rosen >Priority: Blocker > Attachments: scala-sort-by-key.err, test.patch > > > When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, > I get the following errors (one per task): > {code} > 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage > 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 > bytes) > 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] > with ID 0 > 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > This causes the job to hang. > I can deterministically reproduce this by re-running the test, either in > isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2891) Daemon failed to launch worker
[ https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093265#comment-14093265 ] Davies Liu edited comment on SPARK-2891 at 8/11/14 8:45 PM: duplicated to 2898 https://issues.apache.org/jira/browse/SPARK-2898 was (Author: davies): duplicated to 2898 > Daemon failed to launch worker > -- > > Key: SPARK-2891 > URL: https://issues.apache.org/jira/browse/SPARK-2891 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Davies Liu >Priority: Critical > Fix For: 1.1.0 > > > daviesliu@dm:~/work/spark-perf$ /Users/daviesliu/work/spark/bin/spark-submit > --master spark://dm:7077 pyspark-tests/tests.py SchedulerThroughputTest > --num-tasks=1 --num-trials=4 --inter-trial-wait=1 > 14/08/06 17:58:04 WARN JettyUtils: Failed to create UI on port 4040. Trying > again on port 4041. - Failure(java.net.BindException: Address already in use) > Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily > unavailable > 14/08/06 17:59:25 ERROR Executor: Exception in task 9777.0 in stage 1.0 (TID > 19777) > java.lang.IllegalStateException: Python daemon failed to launch worker > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) > at > org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily > unavailable > 14/08/06 17:59:25 ERROR Executor: Exception in task 9781.0 in stage 1.0 (TID > 19781) > java.lang.IllegalStateException: Python daemon failed to launch worker > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) > at > org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/08/06 17:59:25 WARN TaskSetManager: Lost task 9777.0 in stage 1.0 (TID > 19777, localhost): java.lang.IllegalStateException: Python daemon failed to > launch worker > > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) > > org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) > > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) > > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) > org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) >
[jira] [Resolved] (SPARK-2891) Daemon failed to launch worker
[ https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-2891. --- Resolution: Duplicate Fix Version/s: 1.1.0 duplicated to 2898 > Daemon failed to launch worker > -- > > Key: SPARK-2891 > URL: https://issues.apache.org/jira/browse/SPARK-2891 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Davies Liu >Priority: Critical > Fix For: 1.1.0 > > > daviesliu@dm:~/work/spark-perf$ /Users/daviesliu/work/spark/bin/spark-submit > --master spark://dm:7077 pyspark-tests/tests.py SchedulerThroughputTest > --num-tasks=1 --num-trials=4 --inter-trial-wait=1 > 14/08/06 17:58:04 WARN JettyUtils: Failed to create UI on port 4040. Trying > again on port 4041. - Failure(java.net.BindException: Address already in use) > Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily > unavailable > 14/08/06 17:59:25 ERROR Executor: Exception in task 9777.0 in stage 1.0 (TID > 19777) > java.lang.IllegalStateException: Python daemon failed to launch worker > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) > at > org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily > unavailable > 14/08/06 17:59:25 ERROR Executor: Exception in task 9781.0 in stage 1.0 (TID > 19781) > java.lang.IllegalStateException: Python daemon failed to launch worker > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) > at > org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/08/06 17:59:25 WARN TaskSetManager: Lost task 9777.0 in stage 1.0 (TID > 19777, localhost): java.lang.IllegalStateException: Python daemon failed to > launch worker > > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) > > org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) > > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) > > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) > org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > > java.util.concurrent.ThreadPoolExecutor.runWorker(Thread
[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093219#comment-14093219 ] Jim Blomo commented on SPARK-1284: -- I will try to reproduce on the 1.1 branch later this week, thanks for the update! > pyspark hangs after IOError on Executor > --- > > Key: SPARK-1284 > URL: https://issues.apache.org/jira/browse/SPARK-1284 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Jim Blomo >Assignee: Davies Liu > > When running a reduceByKey over a cached RDD, Python fails with an exception, > but the failure is not detected by the task runner. Spark and the pyspark > shell hang waiting for the task to finish. > The error is: > {code} > PySpark worker failed with exception: > Traceback (most recent call last): > File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in > dump_stream > self._write_with_length(obj, stream) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in > _write_with_length > stream.write(serialized) > IOError: [Errno 104] Connection reset by peer > 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as > 4257 bytes in 47 ms > Traceback (most recent call last): > File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in > launch_worker > worker(listen_sock) > File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker > outfile.flush() > IOError: [Errno 32] Broken pipe > {code} > I can reproduce the error by running take(10) on the cached RDD before > running reduceByKey (which looks at the whole input file). > Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2420) Dependency changes for compatibility with Hive
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brock Noland updated SPARK-2420: Labels: Hive (was: ) > Dependency changes for compatibility with Hive > -- > > Key: SPARK-2420 > URL: https://issues.apache.org/jira/browse/SPARK-2420 > Project: Spark > Issue Type: Wish > Components: Build >Affects Versions: 1.0.0 >Reporter: Xuefu Zhang > Labels: Hive > Attachments: spark_1.0.0.patch > > > During the prototyping of HIVE-7292, many library conflicts showed up because > Spark build contains versions of libraries that's vastly different from > current major Hadoop version. It would be nice if we can choose versions > that's in line with Hadoop or shading them in the assembly. Here are the wish > list: > 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 > 2. Shading Spark's jetty and servlet dependency in the assembly. > 3. guava version difference. Spark is using a higher version. I'm not sure > what's the best solution for this. > The list may grow as HIVE-7292 proceeds. > For information only, the attached is a patch that we applied on Spark in > order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2976) There are too many tabs in some source files
[ https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093175#comment-14093175 ] Apache Spark commented on SPARK-2976: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1895 > There are too many tabs in some source files > > > Key: SPARK-2976 > URL: https://issues.apache.org/jira/browse/SPARK-2976 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Minor > > Currently, there are too many tabs in source file, which does not correspond > to coding style. > I saw following 3 files have tabs. > * sorttable.js > * JavaPageRank.java > * JavaKinesisWordCountASL.java -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2101) Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()
[ https://issues.apache.org/jira/browse/SPARK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2101. --- Resolution: Fixed Fix Version/s: 1.1.0 > Python unit tests fail on Python 2.6 because of lack of unittest.skipIf() > - > > Key: SPARK-2101 > URL: https://issues.apache.org/jira/browse/SPARK-2101 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Uri Laserson >Assignee: Josh Rosen > Fix For: 1.1.0 > > > PySpark tests fail with Python 2.6 because they currently depend on > {{unittest.skipIf}}, which was only introduced in Python 2.7. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2977) Fix handling of short shuffle manager names in ShuffleBlockManager
Josh Rosen created SPARK-2977: - Summary: Fix handling of short shuffle manager names in ShuffleBlockManager Key: SPARK-2977 URL: https://issues.apache.org/jira/browse/SPARK-2977 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Josh Rosen Since we allow short names for {{spark.shuffle.manager}}, all code that reads that configuration property should be prepared to handle the short names. See my comment at https://github.com/apache/spark/pull/1799#discussion_r16029607 (opening this as a JIRA so we don't forget to fix it). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2910) Test with Python 2.6 on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2910. --- > Test with Python 2.6 on Jenkins > --- > > Key: SPARK-2910 > URL: https://issues.apache.org/jira/browse/SPARK-2910 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.1.0 > > > As long as we continue to support Python 2.6 in PySpark, Jenkins should test > with Python 2.6. > We could downgrade the system Python to 2.6, but it might be easier / cleaner > to install 2.6 alongside the current Python and {{export > PYSPARK_PYTHON=python2.6}} in the test runner script. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2954. --- Resolution: Fixed Fix Version/s: 1.1.0 > PySpark MLlib serialization tests fail on Python 2.6 > > > Key: SPARK-2954 > URL: https://issues.apache.org/jira/browse/SPARK-2954 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.1.0 > > > The PySpark MLlib tests currently fail on Python 2.6 due to problems > unpacking data from bytearray using struct.unpack: > {code} > ** > File "pyspark/mllib/_common.py", line 181, in __main__._deserialize_double > Failed example: > _deserialize_double(_serialize_double(1L)) == 1.0 > Exception raised: > Traceback (most recent call last): > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1253, in __run > compileflags, 1) in test.globs > File "", line 1, in > _deserialize_double(_serialize_double(1L)) == 1.0 > File "pyspark/mllib/_common.py", line 194, in _deserialize_double > return struct.unpack("d", ba[offset:])[0] > error: unpack requires a string argument of length 8 > ** > File "pyspark/mllib/_common.py", line 184, in __main__._deserialize_double > Failed example: > _deserialize_double(_serialize_double(sys.float_info.max)) == x > Exception raised: > Traceback (most recent call last): > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1253, in __run > compileflags, 1) in test.globs > File "", line 1, in > _deserialize_double(_serialize_double(sys.float_info.max)) == x > File "pyspark/mllib/_common.py", line 194, in _deserialize_double > return struct.unpack("d", ba[offset:])[0] > error: unpack requires a string argument of length 8 > ** > File "pyspark/mllib/_common.py", line 187, in __main__._deserialize_double > Failed example: > _deserialize_double(_serialize_double(sys.float_info.max)) == y > Exception raised: > Traceback (most recent call last): > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1253, in __run > compileflags, 1) in test.globs > File "", line 1, in > _deserialize_double(_serialize_double(sys.float_info.max)) == y > File "pyspark/mllib/_common.py", line 194, in _deserialize_double > return struct.unpack("d", ba[offset:])[0] > error: unpack requires a string argument of length 8 > ** > {code} > It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: > http://stackoverflow.com/a/15467046/590203 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2948) PySpark doesn't work on Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2948. --- Resolution: Fixed Fix Version/s: 1.1.0 > PySpark doesn't work on Python 2.6 > -- > > Key: SPARK-2948 > URL: https://issues.apache.org/jira/browse/SPARK-2948 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: CentOS 6.5 / Python 2.6.6 >Reporter: Kousuke Saruta >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.1.0 > > > In serializser.py, collections.namedtuple is redefined as follows. > {code} > def namedtuple(name, fields, verbose=False, rename=False): > > > cls = _old_namedtuple(name, fields, verbose, rename) > > > return _hack_namedtuple(cls) > > > > {code} > The number of arguments is 4 but the number of arguments of namedtuple for > Python 2.6 is 3 so mismatch is occurred. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2700) Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile
[ https://issues.apache.org/jira/browse/SPARK-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093150#comment-14093150 ] Yin Huai commented on SPARK-2700: - Can we resolve it? > Hidden files (such as .impala_insert_staging) should be filtered out by > sqlContext.parquetFile > -- > > Key: SPARK-2700 > URL: https://issues.apache.org/jira/browse/SPARK-2700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.1 >Reporter: Teng Qiu > Fix For: 1.1.0 > > > when creating a table in impala, a hidden folder .impala_insert_staging will > be created in the folder of table. > if we want to load such a table using Spark SQL API sqlContext.parquetFile, > this hidden folder makes trouble, spark try to get metadata from this folder, > you will see the exception: > {code:borderStyle=solid} > Caused by: java.io.IOException: Could not read footer for file > FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging; > isDirectory=true; modification_time=1406333729252; access_time=0; > owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false} > ... > ... > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is > not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging > {code} > and impala side do not think this is their problem: > https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete > .impala_insert_staging directory after INSERT) > so maybe we should filter out these hidden folder/file by reading parquet > tables -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers
[ https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093133#comment-14093133 ] Apache Spark commented on SPARK-2790: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1894 > PySpark zip() doesn't work properly if RDDs have different serializers > -- > > Key: SPARK-2790 > URL: https://issues.apache.org/jira/browse/SPARK-2790 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0, 1.1.0 >Reporter: Josh Rosen >Assignee: Davies Liu >Priority: Critical > > In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have > different serializers (e.g. batched vs. unbatched), even if those RDDs have > the same number of partitions and same numbers of elements. This problem > occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of > LabelledPoints with a JavaRDD of batch-serialized Python objects. > This is problematic because whether zip() succeeds or errors depends on the > partitioning / batching strategy, and we don't want to surface the > serialization details to users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093137#comment-14093137 ] Davies Liu commented on SPARK-1284: --- [~jblomo], could you reproduce this on master or 1.1 branch? Maybe the pyspark did not hange after this error message, the take() had finished successfully before the error message pop up. The noisy error messages had been fixed in PR https://github.com/apache/spark/pull/1625 > pyspark hangs after IOError on Executor > --- > > Key: SPARK-1284 > URL: https://issues.apache.org/jira/browse/SPARK-1284 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Jim Blomo >Assignee: Davies Liu > > When running a reduceByKey over a cached RDD, Python fails with an exception, > but the failure is not detected by the task runner. Spark and the pyspark > shell hang waiting for the task to finish. > The error is: > {code} > PySpark worker failed with exception: > Traceback (most recent call last): > File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in > dump_stream > self._write_with_length(obj, stream) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in > _write_with_length > stream.write(serialized) > IOError: [Errno 104] Connection reset by peer > 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as > 4257 bytes in 47 ms > Traceback (most recent call last): > File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in > launch_worker > worker(listen_sock) > File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker > outfile.flush() > IOError: [Errno 32] Broken pipe > {code} > I can reproduce the error by running take(10) on the cached RDD before > running reduceByKey (which looks at the whole input file). > Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093119#comment-14093119 ] Yin Huai commented on SPARK-2890: - What is the semantic when you have columns with same names? > Spark SQL should allow SELECT with duplicated columns > - > > Key: SPARK-2890 > URL: https://issues.apache.org/jira/browse/SPARK-2890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Jianshi Huang > > Spark reported error java.lang.IllegalArgumentException with messages: > java.lang.IllegalArgumentException: requirement failed: Found fields with the > same name. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.sql.catalyst.types.StructType.(dataTypes.scala:317) > at > org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) > at > org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) > at > org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) > After trial and error, it seems it's caused by duplicated columns in my > select clause. > I made the duplication on purpose for my code to parse correctly. I think we > should allow users to specify duplicated columns as return value. > Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2976) There are too many tabs in some source files
Kousuke Saruta created SPARK-2976: - Summary: There are too many tabs in some source files Key: SPARK-2976 URL: https://issues.apache.org/jira/browse/SPARK-2976 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor Currently, there are too many tabs in source file, which does not correspond to coding style. I saw following 3 files have tabs. * sorttable.js * JavaPageRank.java * JavaKinesisWordCountASL.java -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is incomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Summary: The description about building to use HiveServer and CLI is incomplete (was: The description about building to use HiveServer and CLI is imcomplete) > The description about building to use HiveServer and CLI is incomplete > -- > > Key: SPARK-2963 > URL: https://issues.apache.org/jira/browse/SPARK-2963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta > > Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use > -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2931: --- Fix Version/s: (was: 1.1.0) > getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException > --- > > Key: SPARK-2931 > URL: https://issues.apache.org/jira/browse/SPARK-2931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf > benchmark >Reporter: Josh Rosen >Priority: Blocker > Attachments: scala-sort-by-key.err, test.patch > > > When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, > I get the following errors (one per task): > {code} > 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage > 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 > bytes) > 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] > with ID 0 > 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > This causes the job to hang. > I can deterministically reproduce this by re-running the test, either in > isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2931: --- Target Version/s: 1.1.0 > getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException > --- > > Key: SPARK-2931 > URL: https://issues.apache.org/jira/browse/SPARK-2931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf > benchmark >Reporter: Josh Rosen >Priority: Blocker > Attachments: scala-sort-by-key.err, test.patch > > > When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, > I get the following errors (one per task): > {code} > 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage > 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 > bytes) > 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] > with ID 0 > 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > This causes the job to hang. > I can deterministically reproduce this by re-running the test, either in > isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2975) SPARK_LOCAL_DIRS may cause problems when running in local mode
Josh Rosen created SPARK-2975: - Summary: SPARK_LOCAL_DIRS may cause problems when running in local mode Key: SPARK-2975 URL: https://issues.apache.org/jira/browse/SPARK-2975 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Josh Rosen Priority: Minor If we're running Spark in local mode and {{SPARK_LOCAL_DIRS}} is set, the {{Executor}} modifies SparkConf so that this value overrides {{spark.local.dir}}. Normally, this is safe because the modification takes place before SparkEnv is created. In local mode, the Executor uses an existing SparkEnv rather than creating a new one, so it winds up with a DiskBlockManager that created local directories with the original {{spark.local.dir}} setting, but other components attempt to use directories specified in the _new_ {{spark.local.dir}}, leading to problems. I discovered this issue while testing Spark 1.1.0-snapshot1, but I think it will also affect Spark 1.0 (haven't confirmed this, though). (I posted some comments at https://github.com/apache/spark/pull/299#discussion-diff-15975800, but also opening this JIRA so this isn't forgotten.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck
[ https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2717: --- Priority: Critical (was: Major) > BasicBlockFetchIterator#next should log when it gets stuck > -- > > Key: SPARK-2717 > URL: https://issues.apache.org/jira/browse/SPARK-2717 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > If this is stuck for a long time waiting for blocks, we should log what nodes > it is waiting for to help debugging. One way to do this is to call take() > with a timeout (e.g. 60 seconds) and when the timeout expires log a message > for the blocks it is still waiting for. This could all happen in a loop so > that the wait just restarts after the message is logged. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck
[ https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2717: --- Priority: Major (was: Blocker) > BasicBlockFetchIterator#next should log when it gets stuck > -- > > Key: SPARK-2717 > URL: https://issues.apache.org/jira/browse/SPARK-2717 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen > > If this is stuck for a long time waiting for blocks, we should log what nodes > it is waiting for to help debugging. One way to do this is to call take() > with a timeout (e.g. 60 seconds) and when the timeout expires log a message > for the blocks it is still waiting for. This could all happen in a loop so > that the wait just restarts after the message is logged. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093018#comment-14093018 ] Apache Spark commented on SPARK-1297: - User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/1893 > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2974) Utils.getLocalDir() may return non-existent spark.local.dir directory
Josh Rosen created SPARK-2974: - Summary: Utils.getLocalDir() may return non-existent spark.local.dir directory Key: SPARK-2974 URL: https://issues.apache.org/jira/browse/SPARK-2974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Josh Rosen Priority: Blocker The patch for [SPARK-2324] modified Spark to ignore a certain number of invalid local directories. Unfortunately, the {{Utils.getLocalDir()}} method returns the _first_ local directory from {{spark.local.dir}}, which might not exist. This can lead to confusing FileNotFound errors when executors attempt to fetch files. (I commented on this at https://github.com/apache/spark/pull/1274#issuecomment-51537965, but I'm opening a JIRA so we don't forget to fix it). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093012#comment-14093012 ] Ted Yu commented on SPARK-1297: --- https://github.com/apache/spark/pull/1893 > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092988#comment-14092988 ] Ted Yu commented on SPARK-1297: --- HBase client doesn't need to specify dependency on hbase-hadoop1-compat or hbase-hadoop2-compat I can open a PR once there is positive feedback on the approach - I came from a project where reviews mostly happen on JIRA :-) Can someone assign this issue to me ? > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2973) Add a way to show tables without executing a job
Aaron Davidson created SPARK-2973: - Summary: Add a way to show tables without executing a job Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Right now, sql("show tables").collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
Shay Rojansky created SPARK-2972: Summary: APPLICATION_COMPLETE not created in Python unless context explicitly stopped Key: SPARK-2972 URL: https://issues.apache.org/jira/browse/SPARK-2972 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2 Environment: Cloudera 5.1, yarn master on ubuntu precise Reporter: Shay Rojansky If you don't explicitly stop a SparkContext at the end of a Python application with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't get picked up by the history server. This can be easily reproduced with pyspark (but affects scripts as well). The current workaround is to wrap the entire script with a try/finally and stop manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092967#comment-14092967 ] Sean Owen commented on SPARK-1297: -- I think you may want to open a PR rather than post patches. Code reviews happen on github.com I see what you did there by triggering one or the other profile with the hbase.profile property. Yeah, that may be the least disruptive way to play this. But don't the profiles need to select the hadoop-compat module appropriate for Hadoop 1 vs Hadoop 2? > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092966#comment-14092966 ] Josh Rosen commented on SPARK-2931: --- Thanks for investigating and reproducing this issue. Is someone planning to open a PR with a fix? If not, I can probably do it later this afternoon, since this bug is a blocker for many of the spark-perf tests that I'm running. > getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException > --- > > Key: SPARK-2931 > URL: https://issues.apache.org/jira/browse/SPARK-2931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf > benchmark >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.1.0 > > Attachments: scala-sort-by-key.err, test.patch > > > When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, > I get the following errors (one per task): > {code} > 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage > 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 > bytes) > 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] > with ID 0 > 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > This causes the job to hang. > I can deterministically reproduce this by re-running the test, either in > isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-1297: -- Attachment: spark-1297-v4.txt Patch v4 adds two profiles to examples/pom.xml : hbase-hadoop1 (default) hbase-hadoop2 I verified that compilation passes with either profile active. > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Minor > Attachments: spark-1297-v2.txt, spark-1297-v4.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Description: Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. (was: Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's implicit. I think we need to describe how to build.) > The description about building to use HiveServer and CLI is imcomplete > -- > > Key: SPARK-2963 > URL: https://issues.apache.org/jira/browse/SPARK-2963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta > > Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use > -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Summary: The description about building to use HiveServer and CLI is imcomplete (was: There no documentation about building to use HiveServer and CLI for SparkSQL) > The description about building to use HiveServer and CLI is imcomplete > -- > > Key: SPARK-2963 > URL: https://issues.apache.org/jira/browse/SPARK-2963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta > > Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use > -Phive-thriftserver option when building but it's implicit. > I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092894#comment-14092894 ] Kousuke Saruta commented on SPARK-2963: --- I've updated this title and Github's one. > The description about building to use HiveServer and CLI is imcomplete > -- > > Key: SPARK-2963 > URL: https://issues.apache.org/jira/browse/SPARK-2963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta > > Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use > -Phive-thriftserver option when building but it's implicit. > I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092889#comment-14092889 ] Cheng Lian edited comment on SPARK-2963 at 8/11/14 3:31 PM: Actually [there is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server], but the Spark CLI part is incomplete. Would you mind to update the Issue title and description? Thanks. was (Author: lian cheng): Actually [there is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server] but the Spark CLI part is incomplete. Would you mind to update the Issue title and description? Thanks. > There no documentation about building to use HiveServer and CLI for SparkSQL > > > Key: SPARK-2963 > URL: https://issues.apache.org/jira/browse/SPARK-2963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta > > Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use > -Phive-thriftserver option when building but it's implicit. > I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092889#comment-14092889 ] Cheng Lian commented on SPARK-2963: --- Actually [there is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server] but the Spark CLI part is incomplete. Would you mind to update the Issue title and description? Thanks. > There no documentation about building to use HiveServer and CLI for SparkSQL > > > Key: SPARK-2963 > URL: https://issues.apache.org/jira/browse/SPARK-2963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta > > Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use > -Phive-thriftserver option when building but it's implicit. > I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092881#comment-14092881 ] Thomas Graves commented on SPARK-2089: -- Sandy, just wondering if you have any ETA on fix for this? > With YARN, preferredNodeLocalityData isn't honored > --- > > Key: SPARK-2089 > URL: https://issues.apache.org/jira/browse/SPARK-2089 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > > When running in YARN cluster mode, apps can pass preferred locality data when > constructing a Spark context that will dictate where to request executor > containers. > This is currently broken because of a race condition. The Spark-YARN code > runs the user class and waits for it to start up a SparkContext. During its > initialization, the SparkContext will create a YarnClusterScheduler, which > notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then > immediately fetches the preferredNodeLocationData from the SparkContext and > uses it to start requesting containers. > But in the SparkContext constructor that takes the preferredNodeLocationData, > setting preferredNodeLocationData comes after the rest of the initialization, > so, if the Spark-YARN code comes around quickly enough after being notified, > the data that's fetched is the empty unset version. The occurred during all > of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092879#comment-14092879 ] Kousuke Saruta commented on SPARK-2970: --- [~liancheng] Thank you pointing my mistake. I've modified the description. > spark-sql script ends with IOException when EventLogging is enabled > --- > > Key: SPARK-2970 > URL: https://issues.apache.org/jira/browse/SPARK-2970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: CDH5.1.0 (Hadoop 2.3.0) >Reporter: Kousuke Saruta > > When spark-sql script run with spark.eventLog.enabled set true, it ends with > IOException because FileLogger can not create APPLICATION_COMPLETE file in > HDFS. > It's is because a shutdown hook of SparkSQLCLIDriver is executed after a > shutdown hook of org.apache.hadoop.fs.FileSystem is executed. > When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally > try to create a file to mark the application finished but the hook of > FileSystem try to close FileSystem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2970: -- Description: When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. It's is because a shutdown hook of SparkSQLCLIDriver is executed after a shutdown hook of org.apache.hadoop.fs.FileSystem is executed. When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try to create a file to mark the application finished but the hook of FileSystem try to close FileSystem. was: When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. I think it's because FIleSystem is closed by HiveSessionImplWithUGI. It has a code as follows. {code} public void close() throws HiveSQLException { try { acquire(); ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); cancelDelegationToken(); } finally { release(); super.close(); } } {code} When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim which extends HadoopShimSecure. HadoopShimSecure#closeAllForUGI is implemented as follows. {code} @Override public void closeAllForUGI(UserGroupInformation ugi) { try { FileSystem.closeAllForUGI(ugi); } catch (IOException e) { LOG.error("Could not clean up file-system handles for UGI: " + ugi, e); } } {code} > spark-sql script ends with IOException when EventLogging is enabled > --- > > Key: SPARK-2970 > URL: https://issues.apache.org/jira/browse/SPARK-2970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: CDH5.1.0 (Hadoop 2.3.0) >Reporter: Kousuke Saruta > > When spark-sql script run with spark.eventLog.enabled set true, it ends with > IOException because FileLogger can not create APPLICATION_COMPLETE file in > HDFS. > It's is because a shutdown hook of SparkSQLCLIDriver is executed after a > shutdown hook of org.apache.hadoop.fs.FileSystem is executed. > When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally > try to create a file to mark the application finished but the hook of > FileSystem try to close FileSystem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092860#comment-14092860 ] Cheng Lian commented on SPARK-2970: --- [~sarutak] Would you mind to update the issue description? Otherwise it can be confusing for people that don't see your comments below. Thanks. > spark-sql script ends with IOException when EventLogging is enabled > --- > > Key: SPARK-2970 > URL: https://issues.apache.org/jira/browse/SPARK-2970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: CDH5.1.0 (Hadoop 2.3.0) >Reporter: Kousuke Saruta > > When spark-sql script run with spark.eventLog.enabled set true, it ends with > IOException because FileLogger can not create APPLICATION_COMPLETE file in > HDFS. > I think it's because FIleSystem is closed by HiveSessionImplWithUGI. > It has a code as follows. > {code} > public void close() throws HiveSQLException { > try { > acquire(); > ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); > cancelDelegationToken(); > } finally { > release(); > super.close(); > } > } > {code} > When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim > which extends HadoopShimSecure. > HadoopShimSecure#closeAllForUGI is implemented as follows. > {code} > @Override > public void closeAllForUGI(UserGroupInformation ugi) { > try { > FileSystem.closeAllForUGI(ugi); > } catch (IOException e) { > LOG.error("Could not clean up file-system handles for UGI: " + ugi, e); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough
[ https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092826#comment-14092826 ] Apache Spark commented on SPARK-1777: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/1892 > Pass "cached" blocks directly to disk if memory is not large enough > --- > > Key: SPARK-1777 > URL: https://issues.apache.org/jira/browse/SPARK-1777 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or >Priority: Critical > Fix For: 1.1.0 > > Attachments: spark-1777-design-doc.pdf > > > Currently in Spark we entirely unroll a partition and then check whether it > will cause us to exceed the storage limit. This has an obvious problem - if > the partition itself is enough to push us over the storage limit (and > eventually over the JVM heap), it will cause an OOM. > This can happen in cases where a single partition is very large or when > someone is running examples locally with a small heap. > https://github.com/apache/spark/blob/f6ff2a61d00d12481bfb211ae13d6992daacdcc2/core/src/main/scala/org/apache/spark/CacheManager.scala#L148 > We should think a bit about the most elegant way to fix this - it shares some > similarities with the external aggregation code. > A simple idea is to periodically check the size of the buffer as we are > unrolling and see if we are over the memory limit. If we are we could prepend > the existing buffer to the iterator and write that entire thing out to disk. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark
[ https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092807#comment-14092807 ] Mridul Muralidharan commented on SPARK-2962: On further investigation : a) The primary issue is a combination of SPARK-2089 and current schedule behavior for pendingTasksWithNoPrefs. SPARK-2089 leads to very bad allocation of nodes - particularly has an impact on bigger clusters. It leads to a lot of block having no data or rack local executors - causing them to end up in pendingTasksWithNoPrefs. While loading data off dfs, when an executor is being scheduled, even though there might be rack local schedules available for it (or, on waiting a while, data local too - see (b) below), because of current scheduler behavior, tasks from pendingTasksWithNoPrefs get scheduled : causing a large number of ANY tasks to be scheduled at the very onset. The combination of these, with lack of marginal alleviation via (b) is what caused the performance impact. b) spark.scheduler.minRegisteredExecutorsRatio was not yet been used in the workload - so that might alleviate some of the non deterministic waiting and ensuring adequate executors are allocated ! Thanks [~lirui] > Suboptimal scheduling in spark > -- > > Key: SPARK-2962 > URL: https://issues.apache.org/jira/browse/SPARK-2962 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: All >Reporter: Mridul Muralidharan > > In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs > are always scheduled with PROCESS_LOCAL > pendingTasksWithNoPrefs contains tasks which currently do not have any alive > locations - but which could come in 'later' : particularly relevant when > spark app is just coming up and containers are still being added. > This causes a large number of non node local tasks to be scheduled incurring > significant network transfers in the cluster when running with non trivial > datasets. > The comment "// Look for no-pref tasks after rack-local tasks since they can > run anywhere." is misleading in the method code : locality levels start from > process_local down to any, and so no prefs get scheduled much before rack. > Also note that, currentLocalityIndex is reset to the taskLocality returned by > this method - so returning PROCESS_LOCAL as the level will trigger wait times > again. (Was relevant before recent change to scheduler, and might be again > based on resolution of this issue). > Found as part of writing test for SPARK-2931 > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever
Shay Rojansky created SPARK-2971: Summary: Orphaned YARN ApplicationMaster lingers forever Key: SPARK-2971 URL: https://issues.apache.org/jira/browse/SPARK-2971 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise Reporter: Shay Rojansky We have cases where if CTRL-C is hit during a Spark job startup, a YARN ApplicationMaster is created but cannot connect to the driver (presumably because the driver has terminated). Once an AM enters this state it never exits it, and has to be manually killed in YARN. Here's an excerpt from the AM logs: {noformat} SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roji) 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started 14/08/11 16:29:40 INFO Remoting: Starting remoting 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at master.grid.eaglerd.local/192.168.41.100:8030 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1407759736957_0014_01 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be reachable. 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092710#comment-14092710 ] Apache Spark commented on SPARK-2970: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1891 > spark-sql script ends with IOException when EventLogging is enabled > --- > > Key: SPARK-2970 > URL: https://issues.apache.org/jira/browse/SPARK-2970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: CDH5.1.0 (Hadoop 2.3.0) >Reporter: Kousuke Saruta > > When spark-sql script run with spark.eventLog.enabled set true, it ends with > IOException because FileLogger can not create APPLICATION_COMPLETE file in > HDFS. > I think it's because FIleSystem is closed by HiveSessionImplWithUGI. > It has a code as follows. > {code} > public void close() throws HiveSQLException { > try { > acquire(); > ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); > cancelDelegationToken(); > } finally { > release(); > super.close(); > } > } > {code} > When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim > which extends HadoopShimSecure. > HadoopShimSecure#closeAllForUGI is implemented as follows. > {code} > @Override > public void closeAllForUGI(UserGroupInformation ugi) { > try { > FileSystem.closeAllForUGI(ugi); > } catch (IOException e) { > LOG.error("Could not clean up file-system handles for UGI: " + ugi, e); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092705#comment-14092705 ] Kousuke Saruta commented on SPARK-2970: --- I noticed it's not caused by the reason above. It's caused by shutdown hook of FileSystem. I have already resolved it to execute shutdown hook for stopping SparkSQLContext before the shutdown hook for FileSystem. > spark-sql script ends with IOException when EventLogging is enabled > --- > > Key: SPARK-2970 > URL: https://issues.apache.org/jira/browse/SPARK-2970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: CDH5.1.0 (Hadoop 2.3.0) >Reporter: Kousuke Saruta > > When spark-sql script run with spark.eventLog.enabled set true, it ends with > IOException because FileLogger can not create APPLICATION_COMPLETE file in > HDFS. > I think it's because FIleSystem is closed by HiveSessionImplWithUGI. > It has a code as follows. > {code} > public void close() throws HiveSQLException { > try { > acquire(); > ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); > cancelDelegationToken(); > } finally { > release(); > super.close(); > } > } > {code} > When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim > which extends HadoopShimSecure. > HadoopShimSecure#closeAllForUGI is implemented as follows. > {code} > @Override > public void closeAllForUGI(UserGroupInformation ugi) { > try { > FileSystem.closeAllForUGI(ugi); > } catch (IOException e) { > LOG.error("Could not clean up file-system handles for UGI: " + ugi, e); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
Kousuke Saruta created SPARK-2970: - Summary: spark-sql script ends with IOException when EventLogging is enabled Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. I think it's because FIleSystem is closed by HiveSessionImplWithUGI. It has a code as follows. {code} public void close() throws HiveSQLException { try { acquire(); ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); cancelDelegationToken(); } finally { release(); super.close(); } } {code} When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim which extends HadoopShimSecure. HadoopShimSecure#closeAllForUGI is implemented as follows. {code} @Override public void closeAllForUGI(UserGroupInformation ugi) { try { FileSystem.closeAllForUGI(ugi); } catch (IOException e) { LOG.error("Could not clean up file-system handles for UGI: " + ugi, e); } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator
[ https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092677#comment-14092677 ] Apache Spark commented on SPARK-2878: - User 'GrahamDennis' has created a pull request for this issue: https://github.com/apache/spark/pull/1890 > Inconsistent Kryo serialisation with custom Kryo Registrator > > > Key: SPARK-2878 > URL: https://issues.apache.org/jira/browse/SPARK-2878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.2 > Environment: Linux RedHat EL 6, 4-node Spark cluster. >Reporter: Graham Dennis > > The custom Kryo Registrator (a class with the > org.apache.spark.serializer.KryoRegistrator trait) is not used with every > Kryo instance created, and this causes inconsistent serialisation and > deserialisation. > The Kryo Registrator is sometimes not used because of a ClassNotFound > exception that only occurs if it *isn't* the Worker thread (of an Executor) > that tries to create the KryoRegistrator. > A complete description of the problem and a project reproducing the problem > can be found at https://github.com/GrahamDennis/spark-kryo-serialisation > I have currently only tested this with Spark 1.0.0, but will try to test > against 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator
[ https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092675#comment-14092675 ] Graham Dennis commented on SPARK-2878: -- I've created a pull request with work-in-progress changes that I'd like feedback on: https://github.com/apache/spark/pull/1890 > Inconsistent Kryo serialisation with custom Kryo Registrator > > > Key: SPARK-2878 > URL: https://issues.apache.org/jira/browse/SPARK-2878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.2 > Environment: Linux RedHat EL 6, 4-node Spark cluster. >Reporter: Graham Dennis > > The custom Kryo Registrator (a class with the > org.apache.spark.serializer.KryoRegistrator trait) is not used with every > Kryo instance created, and this causes inconsistent serialisation and > deserialisation. > The Kryo Registrator is sometimes not used because of a ClassNotFound > exception that only occurs if it *isn't* the Worker thread (of an Executor) > that tries to create the KryoRegistrator. > A complete description of the problem and a project reproducing the problem > can be found at https://github.com/GrahamDennis/spark-kryo-serialisation > I have currently only tested this with Spark 1.0.0, but will try to test > against 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.
[ https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092671#comment-14092671 ] Apache Spark commented on SPARK-2969: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/1889 > Make ScalaReflection be able to handle MapType.containsNull and > MapType.valueContainsNull. > -- > > Key: SPARK-2969 > URL: https://issues.apache.org/jira/browse/SPARK-2969 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > Make {{ScalaReflection}} be able to handle like: > - Seq\[Int] as ArrayType(IntegerType, containsNull = false) > - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) > - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) > - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, > valueContainsNull = true) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.
[ https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-2969: - Description: Make {{ScalaReflection}} be able to handle like: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) was: Make {{ScalaReflection}} be able to handle: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) > Make ScalaReflection be able to handle MapType.containsNull and > MapType.valueContainsNull. > -- > > Key: SPARK-2969 > URL: https://issues.apache.org/jira/browse/SPARK-2969 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > Make {{ScalaReflection}} be able to handle like: > - Seq\[Int] as ArrayType(IntegerType, containsNull = false) > - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) > - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) > - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, > valueContainsNull = true) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.
Takuya Ueshin created SPARK-2969: Summary: Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull. Key: SPARK-2969 URL: https://issues.apache.org/jira/browse/SPARK-2969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Make {{ScalaReflection}} be able to handle: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2968) Fix nullabilities of Explode.
[ https://issues.apache.org/jira/browse/SPARK-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092645#comment-14092645 ] Apache Spark commented on SPARK-2968: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/1888 > Fix nullabilities of Explode. > - > > Key: SPARK-2968 > URL: https://issues.apache.org/jira/browse/SPARK-2968 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > Output nullabilities of {{Explode}} could be detemined by > {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2968) Fix nullabilities of Explode.
Takuya Ueshin created SPARK-2968: Summary: Fix nullabilities of Explode. Key: SPARK-2968 URL: https://issues.apache.org/jira/browse/SPARK-2968 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Output nullabilities of {{Explode}} could be detemined by {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2967) Several SQL unit test failed when sort-based shuffle is enabled
Saisai Shao created SPARK-2967: -- Summary: Several SQL unit test failed when sort-based shuffle is enabled Key: SPARK-2967 URL: https://issues.apache.org/jira/browse/SPARK-2967 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Saisai Shao Several SQLQuerySuite unit test failed when sort-based shuffle is enabled. Seems SQL test uses GenericMutableRow which will make ExternalSorter's internal buffer all refered to the same object finally because of object's mutability. Seems row should be copied when feeding into ExternalSorter. The error shows below, though have many failures, I only pasted part of them: {noformat} SQLQuerySuite: - SPARK-2041 column name equals tablename - SPARK-2407 Added Parser of SQL SUBSTR() - index into array - left semi greater than predicate - index into array of arrays - agg *** FAILED *** Results do not match for query: Aggregate ['a], ['a,SUM('b) AS c1#38] UnresolvedRelation None, testData2, None == Analyzed Plan == Aggregate [a#4], [a#4,SUM(CAST(b#5, LongType)) AS c1#38L] SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215) == Physical Plan == Aggregate false, [a#4], [a#4,SUM(PartialSum#40L) AS c1#38L] Exchange (HashPartitioning [a#4], 200) Aggregate true, [a#4], [a#4,SUM(CAST(b#5, LongType)) AS PartialSum#40L] ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215 == Results == !== Correct Answer - 3 == == Spark Answer - 3 == !Vector(1, 3) [1,3] !Vector(2, 3) [1,3] !Vector(3, 3) [1,3] (QueryTest.scala:53) - aggregates with nulls - select * - simple select - sorting *** FAILED *** Results do not match for query: Sort ['a ASC,'b ASC] Project [*] UnresolvedRelation None, testData2, None == Analyzed Plan == Sort [a#4 ASC,b#5 ASC] Project [a#4,b#5] SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215) == Physical Plan == Sort [a#4 ASC,b#5 ASC], true Exchange (RangePartitioning [a#4 ASC,b#5 ASC], 200) ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215 == Results == !== Correct Answer - 6 == == Spark Answer - 6 == !Vector(1, 1) [3,2] !Vector(1, 2) [3,2] !Vector(2, 1) [3,2] !Vector(2, 2) [3,2] !Vector(3, 1) [3,2] !Vector(3, 2) [3,2] (QueryTest.scala:53) - limit - average - average overflow *** FAILED *** Results do not match for query: Aggregate ['b], [AVG('a) AS c0#90,'b] UnresolvedRelation None, largeAndSmallInts, None == Analyzed Plan == Aggregate [b#3], [AVG(CAST(a#2, LongType)) AS c0#90,b#3] SparkLogicalPlan (ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:215) == Physical Plan == Aggregate false, [b#3], [(CAST(SUM(PartialSum#93L), DoubleType) / CAST(SUM(PartialCount#94L), DoubleType)) AS c0#90,b#3] Exchange (HashPartitioning [b#3], 200) Aggregate true, [b#3], [b#3,COUNT(CAST(a#2, LongType)) AS PartialCount#94L,SUM(CAST(a#2, LongType)) AS PartialSum#93L] ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:215 == Results == !== Correct Answer - 2 == == Spark Answer - 2 == !Vector(2.0, 2) [2.147483645E9,1] !Vector(2.147483645E9, 1) [2.147483645E9,1] (QueryTest.scala:53) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-2966: --- Summary: Add an approximation algorithm for hierarchical clustering to MLlib (was: Add an approximation algorithm for hierarchical clustering algorithm to MLlib) > Add an approximation algorithm for hierarchical clustering to MLlib > --- > > Key: SPARK-2966 > URL: https://issues.apache.org/jira/browse/SPARK-2966 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > A hierarchical clustering algorithm is a useful unsupervised learning method. > Koga. et al. proposed highly scalable hierarchical clustering altgorithm in > (1). > I would like to implement this method. > I suggest adding an approximate hierarchical clustering algorithm to MLlib. > I'd like this to be assigned to me. > h3. Reference > # Fast agglomerative hierarchical clustering algorithm using > Locality-Sensitive Hashing > http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2966) Add an approximation algorithm for hierarchical clustering algorithm to MLlib
Yu Ishikawa created SPARK-2966: -- Summary: Add an approximation algorithm for hierarchical clustering algorithm to MLlib Key: SPARK-2966 URL: https://issues.apache.org/jira/browse/SPARK-2966 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor A hierarchical clustering algorithm is a useful unsupervised learning method. Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1). I would like to implement this method. I suggest adding an approximate hierarchical clustering algorithm to MLlib. I'd like this to be assigned to me. h3. Reference # Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2862) DoubleRDDFunctions.histogram() throws exception for some inputs
[ https://issues.apache.org/jira/browse/SPARK-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan Kumar updated SPARK-2862: - Description: histogram method call throws an IndexOutOfBoundsException when the choice of bucketCount partitions the RDD in irrational increments e.g. scala> val r = sc.parallelize(6 to 99) r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 scala> r.histogram(9) java.lang.IndexOutOfBoundsException: 9 at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124) at scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66) at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237) at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54) at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241) at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249) at scala.collection.AbstractTraversable.toArray(Traversable.scala:105) at org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) was: histogram method call throws the below stack trace when the choice of bucketCount partitions the RDD in irrational increments e.g. scala> val r = sc.parallelize(6 to 99) r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 scala> r.histogram(9) java.lang.IndexOutOfBoundsException: 9 at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124) at scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66) at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237) at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54) at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241) at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249) at scala.collection.AbstractTraversable.toArray(Traversable.scala:105) at org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) > DoubleRDDFunctions.histogram() throws exception for some inputs > --- > > Key: SPARK-2862 > URL: https://issues.apache.org/jira/browse/SPARK-2862 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1 > Environment: Scala version 2.9.2 (OpenJDK 64-Bit Server VM, Java > 1.7.0_55) running on Ubuntu 14.04 >Reporter: Chandan Kumar > > histogram method call throws an IndexOutOfBoundsException when the choice of > bucketCount partitions the RDD in irrational increments e.g. > scala> val r = sc.parallelize(6 to 99) > r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at > :12 > scala> r.histogram(9) > java.lang.IndexOutOfBoundsException: 9 > at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124) > at > scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66) > at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237) > at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54) > at > scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241) > at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249) > at scala.collection.AbstractTraversable.toArray(Traversable.scala:105) > at > org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116) > at $iwC$$iwC$$iwC$$iwC.(:15) > at $iwC$$iwC$$iwC.(:20) > at $iwC$$iwC.(:22) > at $iwC.(:24) > at (:26) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2965) Fix HashOuterJoin output nullabilities.
[ https://issues.apache.org/jira/browse/SPARK-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092584#comment-14092584 ] Apache Spark commented on SPARK-2965: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/1887 > Fix HashOuterJoin output nullabilities. > --- > > Key: SPARK-2965 > URL: https://issues.apache.org/jira/browse/SPARK-2965 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > Output attributes of opposite side of {{OuterJoin}} should be nullable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2965) Fix HashOuterJoin output nullabilities.
Takuya Ueshin created SPARK-2965: Summary: Fix HashOuterJoin output nullabilities. Key: SPARK-2965 URL: https://issues.apache.org/jira/browse/SPARK-2965 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Output attributes of opposite side of {{OuterJoin}} should be nullable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2964) Wrong silent option in spark-sql script
[ https://issues.apache.org/jira/browse/SPARK-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092515#comment-14092515 ] Apache Spark commented on SPARK-2964: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1886 > Wrong silent option in spark-sql script > --- > > Key: SPARK-2964 > URL: https://issues.apache.org/jira/browse/SPARK-2964 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Minor > > In spark-sql script, -s option is handled as silent option but > org.apache.hadoop.hive.cli.OptionProcessor interpret -S (large character) as > silent mode option. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org