[jira] [Created] (SPARK-11115) IPv6 regression
Thomas Dudziak created SPARK-5: -- Summary: IPv6 regression Key: SPARK-5 URL: https://issues.apache.org/jira/browse/SPARK-5 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.1 Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6 Reporter: Thomas Dudziak Priority: Critical When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error: 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext. java.lang.AssertionError: assertion failed: Expected hostname at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.util.Utils$.checkHost(Utils.scala:805) at org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48) at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190) at org.apache.spark.SparkContext.(SparkContext.scala:528) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) Looking at the code in question, it seems that the code will only work for IPv4 as it assumes ':' can't be part of the hostname (which it clearly can for IPv6 addresses). Instead, the code should probably use Guava's HostAndPort class, i.e.: def checkHost(host: String, message: String = "") { assert(!HostAndPort.fromString(host).hasPort, message) } def checkHostPort(hostPort: String, message: String = "") { assert(HostAndPort.fromString(hostPort).hasPort, message) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10510) Add documentation for how to register a custom Kryo serializer in Spark
Thomas Dudziak created SPARK-10510: -- Summary: Add documentation for how to register a custom Kryo serializer in Spark Key: SPARK-10510 URL: https://issues.apache.org/jira/browse/SPARK-10510 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Thomas Dudziak Priority: Minor The documentation states how to register classes and links to Kryo documentation for writing custom serializers, but doesn't say how to register those with Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders
[ https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659510#comment-14659510 ] Thomas Dudziak commented on SPARK-6384: --- I have seen this with Orc as well (using INSERT INTO TABLE) in 1.4.1. saveAsParquet doesn't clean up attempt_* folders Key: SPARK-6384 URL: https://issues.apache.org/jira/browse/SPARK-6384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Rex Xiong After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL (Hive table) throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ../attempt_201503170229_0006_r_06_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders
[ https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Dudziak updated SPARK-6384: -- Comment: was deleted (was: I have seen this with Orc as well (using INSERT INTO TABLE) in 1.4.1.) saveAsParquet doesn't clean up attempt_* folders Key: SPARK-6384 URL: https://issues.apache.org/jira/browse/SPARK-6384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Rex Xiong After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL (Hive table) throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ../attempt_201503170229_0006_r_06_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609189#comment-14609189 ] Thomas Dudziak commented on SPARK-5480: --- We get the same exception when running a somewhat larger LDA (5 million documents, 10k topics, 400 partitions) with Spark 1.3.1. GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: --- Key: SPARK-5480 URL: https://issues.apache.org/jira/browse/SPARK-5480 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0, 1.3.1 Environment: Yarn client Reporter: Stephane Maarek Running the following code: val subgraph = graph.subgraph ( vpred = (id,article) = //working predicate) ).cache() println( sSubgraph contains ${subgraph.vertices.count} nodes and ${subgraph.edges.count} edges) val prGraph = subgraph.staticPageRank(5).cache val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { (v, title, rank) = (rank.getOrElse(0.0), title) } titleAndPrGraph.vertices.top(13) { Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1) }.foreach(t = println(t._2._2._1 + : + t._2._1 + , id: + t._1)) Returns a graph with 5000 nodes and 4000 edges. Then it crashes during the PageRank with the following: 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at
[jira] [Created] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job
Thomas Dudziak created SPARK-7874: - Summary: Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job Key: SPARK-7874 URL: https://issues.apache.org/jira/browse/SPARK-7874 Project: Spark Issue Type: Wish Components: Mesos Affects Versions: 1.3.1 Reporter: Thomas Dudziak Priority: Minor This would be a very simple yet effective way to prevent a job dominating the cluster. A way to override it per job would also be nice but not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7875) Exception when using CLUSTER BY or ORDER BY
Thomas Dudziak created SPARK-7875: - Summary: Exception when using CLUSTER BY or ORDER BY Key: SPARK-7875 URL: https://issues.apache.org/jira/browse/SPARK-7875 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Mesos scheduler with fine-grained mode Reporter: Thomas Dudziak Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:56) at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:647) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:647) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org