Re: filtering out non English tweets using TwitterUtils
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use? 3.0.3 https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53 Tobias
Re: Java IO Stream Corrupted - Invalid Type AC?
Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/06/05 20:02:28 INFO scheduler.TaskSchedulerImpl: Cancelling stage 5 On Wed, Jun 4, 2014 at 7:50 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jun 4, 2014 at 3:33 PM, Matt Kielo mki...@oculusinfo.com wrote: Im trying run some spark code on a cluster but I keep running into a java.io.StreamCorruptedException: invalid type code: AC error. My task involves analyzing ~50GB of data (some operations involve
Re: Spark 1.0: slf4j version conflicts with pig
Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER /usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; edgeList0 = LOAD '/user/rfcompton/twitter-mention-networks/bidirectional-network-current/part-r-1' USING PigStorage() AS (id1:long, id2:long, weight:int); ttt = LIMIT edgeList0 10; DUMP ttt; On Wed, May 28, 2014 at 12:55 PM, Ryan Compton compton.r...@gmail.com wrote: It appears to be Spark 1.0 related. I made a pom.xml with a single dependency on Spark, registering the resulting jar created the error. Spark 1.0 was compiled via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly The pom.xml, as well as some other information, is below. The only thing that should not be standard is the inclusion of my in-house repository (it's where I host the spark jar I compiled above). project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd; modelVersion4.0.0/modelVersion groupIdcom.mycompany.app/groupId artifactIdmy-app/artifactId version1.0-SNAPSHOT/version packagingjar/packaging namemy-app/name urlhttp://maven.apache.org/url properties maven.compiler.source1.6/maven.compiler.source maven.compiler.target1.6/maven.compiler.target encodingUTF-8/encoding scala.version2.10.4/scala.version /properties build pluginManagement plugins plugin groupIdnet.alchim31.maven/groupId artifactIdscala-maven-plugin/artifactId version3.1.5/version /plugin plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-compiler-plugin/artifactId version2.0.2/version /plugin /plugins /pluginManagement plugins plugin groupIdnet.alchim31.maven/groupId artifactIdscala-maven-plugin/artifactId executions execution idscala-compile-first/id phaseprocess-resources/phase goals goaladd-source/goal goalcompile/goal /goals /execution execution idscala-test-compile/id phaseprocess-test-resources/phase goals goaltestCompile/goal /goals /execution /executions /plugin !-- Plugin to create a single jar that includes all dependencies -- plugin artifactIdmaven-assembly-plugin/artifactId version2.4/version configuration descriptorRefs descriptorRefjar-with-dependencies/descriptorRef /descriptorRefs /configuration executions execution idmake-assembly/id phasepackage/phase goals goalsingle/goal /goals /execution /executions /plugin /plugins /build repositories !-- needed for cdh build of Spark -- repository idreleases/id url10.10.1.29:8081/nexus/content/repositories/releases/url /repository repository idcloudera/id urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url /repository /repositories dependencies dependency groupIdorg.scala-lang/groupId artifactIdscala-library/artifactId version${scala.version}/version /dependency !--on node29-- dependency groupIdorg.apache.spark/groupId artifactIdspark-assembly/artifactId version1.0.0-cdh3u4/version classifiercdh3u4/classifier /dependency !--spark docs says I need hadoop-client, cdh3u3 repo no longer exists-- dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version0.20.2-cdh3u4/version /dependency /dependencies /project Here's what I get in the dependency tree: [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ my-app --- [INFO] com.mycompany.app:my-app:jar:1.0
Re: Spark 1.0: slf4j version conflicts with pig
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952 On Wed, May 28, 2014 at 1:14 PM, Ryan Compton compton.r...@gmail.com wrote: Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER /usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; edgeList0 = LOAD '/user/rfcompton/twitter-mention-networks/bidirectional-network-current/part-r-1' USING PigStorage() AS (id1:long, id2:long, weight:int); ttt = LIMIT edgeList0 10; DUMP ttt; On Wed, May 28, 2014 at 12:55 PM, Ryan Compton compton.r...@gmail.com wrote: It appears to be Spark 1.0 related. I made a pom.xml with a single dependency on Spark, registering the resulting jar created the error. Spark 1.0 was compiled via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly The pom.xml, as well as some other information, is below. The only thing that should not be standard is the inclusion of my in-house repository (it's where I host the spark jar I compiled above). project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd; modelVersion4.0.0/modelVersion groupIdcom.mycompany.app/groupId artifactIdmy-app/artifactId version1.0-SNAPSHOT/version packagingjar/packaging namemy-app/name urlhttp://maven.apache.org/url properties maven.compiler.source1.6/maven.compiler.source maven.compiler.target1.6/maven.compiler.target encodingUTF-8/encoding scala.version2.10.4/scala.version /properties build pluginManagement plugins plugin groupIdnet.alchim31.maven/groupId artifactIdscala-maven-plugin/artifactId version3.1.5/version /plugin plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-compiler-plugin/artifactId version2.0.2/version /plugin /plugins /pluginManagement plugins plugin groupIdnet.alchim31.maven/groupId artifactIdscala-maven-plugin/artifactId executions execution idscala-compile-first/id phaseprocess-resources/phase goals goaladd-source/goal goalcompile/goal /goals /execution execution idscala-test-compile/id phaseprocess-test-resources/phase goals goaltestCompile/goal /goals /execution /executions /plugin !-- Plugin to create a single jar that includes all dependencies -- plugin artifactIdmaven-assembly-plugin/artifactId version2.4/version configuration descriptorRefs descriptorRefjar-with-dependencies/descriptorRef /descriptorRefs /configuration executions execution idmake-assembly/id phasepackage/phase goals goalsingle/goal /goals /execution /executions /plugin /plugins /build repositories !-- needed for cdh build of Spark -- repository idreleases/id url10.10.1.29:8081/nexus/content/repositories/releases/url /repository repository idcloudera/id urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url /repository /repositories dependencies dependency groupIdorg.scala-lang/groupId artifactIdscala-library/artifactId version${scala.version}/version /dependency !--on node29-- dependency groupIdorg.apache.spark/groupId artifactIdspark-assembly/artifactId version1.0.0-cdh3u4/version classifiercdh3u4/classifier /dependency !--spark docs says I need hadoop-client, cdh3u3 repo no longer exists-- dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version0.20.2-cdh3u4/version /dependency /dependencies /project Here's what
Spark 1.0: slf4j version conflicts with pig
I use both Pig and Spark. All my code is built with Maven into a giant *-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now all my pig scripts fail with: Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) Did Spark 1.0 change the version of slf4j? I can't seem to find it via mvn dependency:tree
Re: GraphX: Help understanding the limitations of Pregel
Whoops, I should have mentioned that it's a multivariate median (cf http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute when all the values are accessible at once. I'm not sure it's possible with a combiner. So, I guess the question should be: Can I use GraphX's Pregel without a combiner? On Wed, Apr 23, 2014 at 7:01 PM, Tom Vacek minnesota...@gmail.com wrote: Here are some out-of-the-box ideas: If the elements lie in a fairly small range and/or you're willing to work with limited precision, you could use counting sort. Moreover, you could iteratively find the median using bisection, which would be associative and commutative. It's easy to think of improvements that would make this approach give a reasonable answer in a few iterations. I have no idea about mixing algorithmic iterations with median-finding iterations. On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton compton.r...@gmail.com wrote: I'm trying shoehorn a label propagation-ish algorithm into GraphX. I need to update each vertex with the median value of their neighbors. Unlike PageRank, which updates each vertex with the mean of their neighbors, I don't have a simple commutative and associative function to use for mergeMsg. What are my options? It looks like I can choose between: 1. a hacky mergeMsg (i.e. combine a,b - Array(a,b) and then do the median in vprog) 2. collectNeighbors and then median 3. ignore GraphX and just do the whole thing with joins (which I actually got working, but its slow) Is there another possibility that I'm missing?
GraphX: .edges.distinct().count() is 10?
I am trying to read an edge list into a Graph. My data looks like 394365859 -- 136153151 589404147 -- 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x = (x(0).toLong, x(2).toLong)) val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123, uniqueEdges = Option(CanonicalRandomVertexCut)) Now, edgeTupRDD.distinct().count() tells me I have 240086 distinct lines in the file, g.numEdges tells me they combined into 240096 weighted edges (which is really weird since that's more lines than in the RDD), but g.edges.distinct().count() tells me I have 10. Why?
Re: GraphX: .edges.distinct().count() is 10?
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's the test that succeeded: test(graph.edges.distinct.count) { withSpark { sc = val edgeFullStrRDD: RDD[String] = sc.parallelize(List( 394365859\t136153151, 589404147\t1361045425)) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x = (x(0).toLong, x(1).toLong)) val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123, uniqueEdges = Option(CanonicalRandomVertexCut)) assert(edgeTupRDD.distinct.count() === 2) assert(g.numEdges === 2) assert(g.edges.distinct.count() === 2) } } Ankur
Re: distinct on huge dataset
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many open files) ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a worker node during a failed .distinct(): http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark tries to open 32000 files. I never ran into this in 0.7.3. Is there a parameter I can set to tell Spark to use less than 32000 files? On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson ilike...@gmail.com wrote: Look up setting ulimit, though note the distinction between soft and hard limits, and that updating your hard limit may require changing /etc/security/limits.confand restarting each worker. On Mon, Mar 24, 2014 at 1:39 AM, Kane kane.ist...@gmail.com wrote: Got a bit further, i think out of memory error was caused by setting spark.spill to false. Now i have this error, is there an easy way to increase file limit for spark, cluster-wide?: java.io.FileNotFoundException: /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 (Too many open files) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:192) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Btw, I've got System.setProperty(spark.shuffle.consolidate.files, true) and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton compton.r...@gmail.com wrote: Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many open files) ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a worker node during a failed .distinct(): http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark tries to open 32000 files. I never ran into this in 0.7.3. Is there a parameter I can set to tell Spark to use less than 32000 files? On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson ilike...@gmail.com wrote: Look up setting ulimit, though note the distinction between soft and hard limits, and that updating your hard limit may require changing /etc/security/limits.confand restarting each worker. On Mon, Mar 24, 2014 at 1:39 AM, Kane kane.ist...@gmail.com wrote: Got a bit further, i think out of memory error was caused by setting spark.spill to false. Now i have this error, is there an easy way to increase file limit for spark, cluster-wide?: java.io.FileNotFoundException: /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 (Too many open files) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:192) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
All pairs shortest paths?
No idea how feasible this is. Has anyone done it?
Re: All pairs shortest paths?
To clarify: I don't need the actual paths, just the distances. On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton compton.r...@gmail.com wrote: No idea how feasible this is. Has anyone done it?