[jira] [Created] (SPARK-11115) IPv6 regression

2015-10-14 Thread Thomas Dudziak (JIRA)
Thomas Dudziak created SPARK-5:
--

 Summary: IPv6 regression
 Key: SPARK-5
 URL: https://issues.apache.org/jira/browse/SPARK-5
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
 Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
Reporter: Thomas Dudziak
Priority: Critical


When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:

15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
java.lang.AssertionError: assertion failed: Expected hostname
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
at 
org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
at 
org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
at org.apache.spark.SparkContext.(SparkContext.scala:528)
at 
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)

Looking at the code in question, it seems that the code will only work for IPv4 
as it assumes ':' can't be part of the hostname (which it clearly can for IPv6 
addresses).
Instead, the code should probably use Guava's HostAndPort class, i.e.:

  def checkHost(host: String, message: String = "") {
assert(!HostAndPort.fromString(host).hasPort, message)
  }

  def checkHostPort(hostPort: String, message: String = "") {
assert(HostAndPort.fromString(hostPort).hasPort, message)
  }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10510) Add documentation for how to register a custom Kryo serializer in Spark

2015-09-08 Thread Thomas Dudziak (JIRA)
Thomas Dudziak created SPARK-10510:
--

 Summary: Add documentation for how to register a custom Kryo 
serializer in Spark
 Key: SPARK-10510
 URL: https://issues.apache.org/jira/browse/SPARK-10510
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Thomas Dudziak
Priority: Minor


The documentation states how to register classes and links to Kryo 
documentation for writing custom serializers, but doesn't say how to register 
those with Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2015-08-05 Thread Thomas Dudziak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659510#comment-14659510
 ] 

Thomas Dudziak commented on SPARK-6384:
---

I have seen this with Orc as well (using INSERT INTO TABLE) in 1.4.1.

 saveAsParquet doesn't clean up attempt_* folders
 

 Key: SPARK-6384
 URL: https://issues.apache.org/jira/browse/SPARK-6384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Rex Xiong

 After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
 _SUCCESS, _common_metadata, _metadata files successfully.
 But sometimes, there will be some attempt_* folder (e.g. 
 attempt_201503170229_0006_r_06_736, 
 attempt_201503170229_0006_r_000404_416) under the same folder, it contains 
 one parquet file, seems to be a working temp folder.
 It happens even though _SUCCESS file created.
 In this situation, SparkSQL (Hive table) throws exception when loading this 
 parquet folder:
 Error: java.io.FileNotFoundException: Path is not a file: 
 ../attempt_201503170229_0006_r_06_736
 at 
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
 va:69)
 at 
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
 va:55)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 UpdateTimes(FSNamesystem.java:1728)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 Int(FSNamesystem.java:1671)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 (FSNamesystem.java:1651)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 (FSNamesystem.java:1625)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
 tions(NameNodeRpcServer.java:503)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
 nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
 2)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
 ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
 l(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
 tion.java:1594)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
 (state=,co
 de=0)
 I'm not sure whether it's a Spark bug or a Parquet bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2015-08-05 Thread Thomas Dudziak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Dudziak updated SPARK-6384:
--
Comment: was deleted

(was: I have seen this with Orc as well (using INSERT INTO TABLE) in 1.4.1.)

 saveAsParquet doesn't clean up attempt_* folders
 

 Key: SPARK-6384
 URL: https://issues.apache.org/jira/browse/SPARK-6384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Rex Xiong

 After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
 _SUCCESS, _common_metadata, _metadata files successfully.
 But sometimes, there will be some attempt_* folder (e.g. 
 attempt_201503170229_0006_r_06_736, 
 attempt_201503170229_0006_r_000404_416) under the same folder, it contains 
 one parquet file, seems to be a working temp folder.
 It happens even though _SUCCESS file created.
 In this situation, SparkSQL (Hive table) throws exception when loading this 
 parquet folder:
 Error: java.io.FileNotFoundException: Path is not a file: 
 ../attempt_201503170229_0006_r_06_736
 at 
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
 va:69)
 at 
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
 va:55)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 UpdateTimes(FSNamesystem.java:1728)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 Int(FSNamesystem.java:1671)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 (FSNamesystem.java:1651)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 (FSNamesystem.java:1625)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
 tions(NameNodeRpcServer.java:503)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
 nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
 2)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
 ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
 l(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
 tion.java:1594)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
 (state=,co
 de=0)
 I'm not sure whether it's a Spark bug or a Parquet bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-06-30 Thread Thomas Dudziak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609189#comment-14609189
 ] 

Thomas Dudziak commented on SPARK-5480:
---

We get the same exception when running a somewhat larger LDA (5 million 
documents, 10k topics, 400 partitions) with Spark 1.3.1.

 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
 ---

 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0, 1.3.1
 Environment: Yarn client
Reporter: Stephane Maarek

 Running the following code:
 val subgraph = graph.subgraph (
   vpred = (id,article) = //working predicate)
 ).cache()
 println( sSubgraph contains ${subgraph.vertices.count} nodes and 
 ${subgraph.edges.count} edges)
 val prGraph = subgraph.staticPageRank(5).cache
 val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
   (v, title, rank) = (rank.getOrElse(0.0), title)
 }
 titleAndPrGraph.vertices.top(13) {
   Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
 }.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))
 Returns a graph with 5000 nodes and 4000 edges.
 Then it crashes during the PageRank with the following:
 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 

[jira] [Created] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-05-26 Thread Thomas Dudziak (JIRA)
Thomas Dudziak created SPARK-7874:
-

 Summary: Add a global setting for the fine-grained mesos scheduler 
that limits the number of concurrent tasks of a job
 Key: SPARK-7874
 URL: https://issues.apache.org/jira/browse/SPARK-7874
 Project: Spark
  Issue Type: Wish
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Thomas Dudziak
Priority: Minor


This would be a very simple yet effective way to prevent a job dominating the 
cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7875) Exception when using CLUSTER BY or ORDER BY

2015-05-26 Thread Thomas Dudziak (JIRA)
Thomas Dudziak created SPARK-7875:
-

 Summary: Exception when using CLUSTER BY or ORDER BY
 Key: SPARK-7875
 URL: https://issues.apache.org/jira/browse/SPARK-7875
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Mesos scheduler with fine-grained mode
Reporter: Thomas Dudziak


Under certain circumstances that I haven't yet been able to isolate, I get the 
following error when doing a HQL query using HiveContext:

org.apache.spark.SparkException: Can only zip RDDs with same number of elements 
in each partition
at 
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:56)
at 
org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
at 
org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:647)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:647)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org