Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this
library works great on tweets https://github.com/carrotsearch/langid-java

On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote:

 But getLang() is one of the methods of twitter4j.Status since version
 3.0.6
 according to the doc at:
http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--

 What version of twitter4j does Spark Streaming use?


 3.0.3
 https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53

 Tobias




Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-06 Thread Ryan Compton
Just ran into this today myself. I'm on branch-1.0 using a CDH3
cluster (no modifications to Spark or its dependencies). The error
appeared trying to run GraphX's .connectedComponents() on a ~200GB
edge list (GraphX worked beautifully on smaller data).

Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ).

14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
4 times; aborting job
14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
VertexRDD.scala:100
Exception in thread main org.apache.spark.SparkException: Job
aborted due to stage failure: Task 5.599:39 failed 4 times, most
recent failure: Exception failure in TID 29735 on host node18:
java.io.StreamCorruptedException: invalid type code: AC
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)

org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)

org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)

org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/06/05 20:02:28 INFO scheduler.TaskSchedulerImpl: Cancelling stage 5

On Wed, Jun 4, 2014 at 7:50 AM, Sean Owen so...@cloudera.com wrote:
 On Wed, Jun 4, 2014 at 3:33 PM, Matt Kielo mki...@oculusinfo.com wrote:
 Im trying run some spark code on a cluster but I keep running into a
 java.io.StreamCorruptedException: invalid type code: AC error. My task
 involves analyzing ~50GB of data (some operations involve 

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
Remark, just including the jar built by sbt will produce the same
error. i,.e this pig script will fail:

REGISTER 
/usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;

edgeList0 = LOAD
'/user/rfcompton/twitter-mention-networks/bidirectional-network-current/part-r-1'
USING PigStorage() AS (id1:long, id2:long, weight:int);
ttt = LIMIT edgeList0 10;
DUMP ttt;

On Wed, May 28, 2014 at 12:55 PM, Ryan Compton compton.r...@gmail.com wrote:
 It appears to be Spark 1.0 related. I made a pom.xml with a single
 dependency on Spark, registering the resulting jar created the error.

 Spark 1.0 was compiled via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly

 The pom.xml, as well as some other information, is below. The only
 thing that should not be standard is the inclusion of my in-house
 repository (it's where I host the spark jar I compiled above).

 project xmlns=http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/xsd/maven-4.0.0.xsd;
 modelVersion4.0.0/modelVersion

 groupIdcom.mycompany.app/groupId
 artifactIdmy-app/artifactId
 version1.0-SNAPSHOT/version
 packagingjar/packaging

 namemy-app/name
 urlhttp://maven.apache.org/url

 properties
 maven.compiler.source1.6/maven.compiler.source
 maven.compiler.target1.6/maven.compiler.target
 encodingUTF-8/encoding
 scala.version2.10.4/scala.version
 /properties

 build
 pluginManagement
 plugins
 plugin
 groupIdnet.alchim31.maven/groupId
 artifactIdscala-maven-plugin/artifactId
 version3.1.5/version
 /plugin
 plugin
 groupIdorg.apache.maven.plugins/groupId
 artifactIdmaven-compiler-plugin/artifactId
 version2.0.2/version
 /plugin
 /plugins
 /pluginManagement

 plugins

 plugin
 groupIdnet.alchim31.maven/groupId
 artifactIdscala-maven-plugin/artifactId
 executions
 execution
 idscala-compile-first/id
 phaseprocess-resources/phase
 goals
 goaladd-source/goal
 goalcompile/goal
 /goals
 /execution
 execution
 idscala-test-compile/id
 phaseprocess-test-resources/phase
 goals
 goaltestCompile/goal
 /goals
 /execution
 /executions
 /plugin

 !-- Plugin to create a single jar that includes all
 dependencies --
 plugin
 artifactIdmaven-assembly-plugin/artifactId
 version2.4/version
 configuration
 descriptorRefs
 descriptorRefjar-with-dependencies/descriptorRef
 /descriptorRefs
 /configuration
 executions
 execution
 idmake-assembly/id
 phasepackage/phase
 goals
 goalsingle/goal
 /goals
 /execution
 /executions
 /plugin

 /plugins
 /build

   repositories

 !-- needed for cdh build of Spark --
 repository
 idreleases/id
 url10.10.1.29:8081/nexus/content/repositories/releases/url
 /repository

 repository
 idcloudera/id
 
 urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url
 /repository

 /repositories

 dependencies

 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-library/artifactId
 version${scala.version}/version
 /dependency

 !--on node29--
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-assembly/artifactId
 version1.0.0-cdh3u4/version
 classifiercdh3u4/classifier
 /dependency

 !--spark docs says I need hadoop-client, cdh3u3 repo no
 longer exists--
 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version0.20.2-cdh3u4/version
 /dependency

 /dependencies
 /project


 Here's what I get in the dependency tree:

 [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ my-app ---
 [INFO] com.mycompany.app:my-app:jar:1.0

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952

On Wed, May 28, 2014 at 1:14 PM, Ryan Compton compton.r...@gmail.com wrote:
 Remark, just including the jar built by sbt will produce the same
 error. i,.e this pig script will fail:

 REGISTER 
 /usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;

 edgeList0 = LOAD
 '/user/rfcompton/twitter-mention-networks/bidirectional-network-current/part-r-1'
 USING PigStorage() AS (id1:long, id2:long, weight:int);
 ttt = LIMIT edgeList0 10;
 DUMP ttt;

 On Wed, May 28, 2014 at 12:55 PM, Ryan Compton compton.r...@gmail.com wrote:
 It appears to be Spark 1.0 related. I made a pom.xml with a single
 dependency on Spark, registering the resulting jar created the error.

 Spark 1.0 was compiled via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly

 The pom.xml, as well as some other information, is below. The only
 thing that should not be standard is the inclusion of my in-house
 repository (it's where I host the spark jar I compiled above).

 project xmlns=http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/xsd/maven-4.0.0.xsd;
 modelVersion4.0.0/modelVersion

 groupIdcom.mycompany.app/groupId
 artifactIdmy-app/artifactId
 version1.0-SNAPSHOT/version
 packagingjar/packaging

 namemy-app/name
 urlhttp://maven.apache.org/url

 properties
 maven.compiler.source1.6/maven.compiler.source
 maven.compiler.target1.6/maven.compiler.target
 encodingUTF-8/encoding
 scala.version2.10.4/scala.version
 /properties

 build
 pluginManagement
 plugins
 plugin
 groupIdnet.alchim31.maven/groupId
 artifactIdscala-maven-plugin/artifactId
 version3.1.5/version
 /plugin
 plugin
 groupIdorg.apache.maven.plugins/groupId
 artifactIdmaven-compiler-plugin/artifactId
 version2.0.2/version
 /plugin
 /plugins
 /pluginManagement

 plugins

 plugin
 groupIdnet.alchim31.maven/groupId
 artifactIdscala-maven-plugin/artifactId
 executions
 execution
 idscala-compile-first/id
 phaseprocess-resources/phase
 goals
 goaladd-source/goal
 goalcompile/goal
 /goals
 /execution
 execution
 idscala-test-compile/id
 phaseprocess-test-resources/phase
 goals
 goaltestCompile/goal
 /goals
 /execution
 /executions
 /plugin

 !-- Plugin to create a single jar that includes all
 dependencies --
 plugin
 artifactIdmaven-assembly-plugin/artifactId
 version2.4/version
 configuration
 descriptorRefs
 descriptorRefjar-with-dependencies/descriptorRef
 /descriptorRefs
 /configuration
 executions
 execution
 idmake-assembly/id
 phasepackage/phase
 goals
 goalsingle/goal
 /goals
 /execution
 /executions
 /plugin

 /plugins
 /build

   repositories

 !-- needed for cdh build of Spark --
 repository
 idreleases/id
 url10.10.1.29:8081/nexus/content/repositories/releases/url
 /repository

 repository
 idcloudera/id
 
 urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url
 /repository

 /repositories

 dependencies

 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-library/artifactId
 version${scala.version}/version
 /dependency

 !--on node29--
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-assembly/artifactId
 version1.0.0-cdh3u4/version
 classifiercdh3u4/classifier
 /dependency

 !--spark docs says I need hadoop-client, cdh3u3 repo no
 longer exists--
 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version0.20.2-cdh3u4/version
 /dependency

 /dependencies
 /project


 Here's what

Spark 1.0: slf4j version conflicts with pig

2014-05-27 Thread Ryan Compton
I use both Pig and Spark. All my code is built with Maven into a giant
*-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now
all my pig scripts fail with:

Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)


Did Spark 1.0 change the version of slf4j? I can't seem to find it via
mvn dependency:tree


Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
Whoops, I should have mentioned that it's a multivariate median (cf
http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute
when all the values are accessible at once. I'm not sure it's possible
with a combiner. So, I guess the question should be: Can I use
GraphX's Pregel without a combiner?

On Wed, Apr 23, 2014 at 7:01 PM, Tom Vacek minnesota...@gmail.com wrote:
 Here are some out-of-the-box ideas:  If the elements lie in a fairly small
 range and/or you're willing to work with limited precision, you could use
 counting sort.  Moreover, you could iteratively find the median using
 bisection, which would be associative and commutative.  It's easy to think
 of improvements that would make this approach give a reasonable answer in a
 few iterations.  I have no idea about mixing algorithmic iterations with
 median-finding iterations.


 On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton compton.r...@gmail.com
 wrote:

 I'm trying shoehorn a label propagation-ish algorithm into GraphX. I
 need to update each vertex with the median value of their neighbors.
 Unlike PageRank, which updates each vertex with the mean of their
 neighbors, I don't have a simple commutative and associative function
 to use for mergeMsg.

 What are my options? It looks like I can choose between:

 1. a hacky mergeMsg (i.e. combine a,b - Array(a,b) and then do the
 median in vprog)
 2. collectNeighbors and then median
 3. ignore GraphX and just do the whole thing with joins (which I
 actually got working, but its slow)

 Is there another possibility that I'm missing?




GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like

394365859 -- 136153151
589404147 -- 1361045425

I read it into a Graph via:

val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName)
val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
   .map(x = (x(0).toLong, x(2).toLong))
val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
uniqueEdges = Option(CanonicalRandomVertexCut))

Now, edgeTupRDD.distinct().count() tells me I have 240086 distinct
lines in the file, g.numEdges tells me they combined into 240096
weighted edges (which is really weird since that's more lines than in
the RDD), but g.edges.distinct().count() tells me I have 10. Why?


Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt

This code:

println(g.numEdges)
println(g.numVertices)
println(g.edges.distinct().count())

gave me

1
9294
2



On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote:
 I wasn't able to reproduce this with a small test file, but I did change the
 file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take
 the third column rather than the second?

 If so, would you mind posting a larger sample of the file, or even the whole
 file if possible?

 Here's the test that succeeded:

   test(graph.edges.distinct.count) {
 withSpark { sc =
   val edgeFullStrRDD: RDD[String] = sc.parallelize(List(
 394365859\t136153151, 589404147\t1361045425))
   val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
 .map(x = (x(0).toLong, x(1).toLong))
   val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
 uniqueEdges = Option(CanonicalRandomVertexCut))
   assert(edgeTupRDD.distinct.count() === 2)
   assert(g.numEdges === 2)
   assert(g.edges.distinct.count() === 2)
 }
   }

 Ankur


Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now)

When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
I'm guessing about 500M are distinct) my jobs fail with:

14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException:
/tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many
open files)

ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a
worker node during a failed .distinct():
http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark
tries to open 32000 files.

I never ran into this in 0.7.3. Is there a parameter I can set to tell
Spark to use less than 32000 files?

On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson ilike...@gmail.com wrote:
 Look up setting ulimit, though note the distinction between soft and hard
 limits, and that updating your hard limit may require changing
 /etc/security/limits.confand restarting each worker.


 On Mon, Mar 24, 2014 at 1:39 AM, Kane kane.ist...@gmail.com wrote:

 Got a bit further, i think out of memory error was caused by setting
 spark.spill to false. Now i have this error, is there an easy way to
 increase file limit for spark, cluster-wide?:

 java.io.FileNotFoundException:

 /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
 (Too many open files)
 at java.io.FileOutputStream.openAppend(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:192)
 at

 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
 at

 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
 at

 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
 at

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
 at
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
 at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
 at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
 at

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at

 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at

 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty(spark.shuffle.consolidate.files,
true) and use ext3 (CentOS...)

On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton compton.r...@gmail.com wrote:
 Does this continue in newer versions? (I'm on 0.8.0 now)

 When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
 I'm guessing about 500M are distinct) my jobs fail with:

 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
 java.io.FileNotFoundException
 java.io.FileNotFoundException:
 /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many
 open files)

 ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a
 worker node during a failed .distinct():
 http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark
 tries to open 32000 files.

 I never ran into this in 0.7.3. Is there a parameter I can set to tell
 Spark to use less than 32000 files?

 On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson ilike...@gmail.com wrote:
 Look up setting ulimit, though note the distinction between soft and hard
 limits, and that updating your hard limit may require changing
 /etc/security/limits.confand restarting each worker.


 On Mon, Mar 24, 2014 at 1:39 AM, Kane kane.ist...@gmail.com wrote:

 Got a bit further, i think out of memory error was caused by setting
 spark.spill to false. Now i have this error, is there an easy way to
 increase file limit for spark, cluster-wide?:

 java.io.FileNotFoundException:

 /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
 (Too many open files)
 at java.io.FileOutputStream.openAppend(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:192)
 at

 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
 at

 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
 at

 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
 at

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
 at
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
 at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
 at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
 at

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at

 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at

 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




All pairs shortest paths?

2014-03-26 Thread Ryan Compton
No idea how feasible this is. Has anyone done it?


Re: All pairs shortest paths?

2014-03-26 Thread Ryan Compton
To clarify: I don't need the actual paths, just the distances.

On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton compton.r...@gmail.com wrote:
 No idea how feasible this is. Has anyone done it?