Spark's Behavior 2
Hi TD, I have sent more informations now using 8 workers. The gap has been 27 sec now. Have you seen? Thanks BR -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Accuracy in mllib BinaryClassificationMetrics
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui On Mon, May 12, 2014 at 1:26 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I see precision and recall but no accuracy in mllib.evaluation.binary. Is it already under development or it needs to be added ? Thanks. Deb
Re: Dead lock running multiple Spark jobs on Mesos
Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https://spark.apache.org/docs/latest/running-on-mesos.html -- Mesos Run Modes section The quick fix should be to set spark.cores.max to half of your cluster's cores to support running two jobs concurrently. Alternatively, switching to fine-grained mode would help here too at the expense of higher latency on startup. On Mon, May 12, 2014 at 12:37 PM, Martin Weindel martin.wein...@gmail.comwrote: I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0. If I run a single Spark Job, the job runs fine on Mesos. Running multiple Spark Jobs also works, if I'm using the coarse-grained mode (spark.mesos.coarse = true). But if I run two Spark Jobs in parallel using the fine-grained mode, the jobs seem to block each other after a few seconds. And the Mesos UI reports no idle but also no used CPUs in this state. As soon as I kill one job, the other continues normally. See below for some log output. Looks to me as if something strange happens with the CPU resources. Can anybody give me a hint about the cause? The jobs read some HDFS files, but have no other communication to external processes. Or any other suggestions how to analyze this problem? Thanks, Martin - Here is the relevant log output of the driver of job1: INFO 17:53:09,247 Missing parents for Stage 2: List() INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is now runnable INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2 (MapPartitionsRDD[9] at mapPartitions at HighTemperatureSpansPerLogfile.java:92) INFO 17:53:09,269 Adding task set 2.0 with 1 tasks *** at this point the job was killed *** log output of driver of job2: INFO 17:53:04,874 Missing parents for Stage 6: List() INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at ComputeLogFileTimespan.java:71), which is now runnable INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23] at values at ComputeLogFileTimespan.java:71) INFO 17:53:04,882 Adding task set 6.0 with 1 tasks *** at this point the job 1 was killed *** INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor 20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL) INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to spark@ sp...@ustst018-cep-node1.usu.usu.grp:40542myclusternode:40542sp...@ustst018-cep-node1.usu.usu.grp:40542 INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes
something about pipeline
Dear, all definition of fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be blocking on block B. by the definition of fetch wait time, can i make a conclusion that tasks pipeline block fetch and the real work? how spark decides the task can be splitted by blocks to do the pipeline? if the task is something like: val b = a.mapPartitions{ itr = timeStamp val arr = itr.toArray ... timeStamp arr.toIterator } can fetching blocks of RDD a and processing RDD b be pipelined? here's the information of my task: Launch Time:1399882225433 Finish Time: 1399882252948 Executor Run Time:27497 Shuffle Finish Time:1399882246138 Fetch Wait Time:9377 task time in a.mapPartitions is 8287 (say it mapPartition time) Finish Time - Launch Time = 27515 Shuffle Finish Time - Launch Time = 20705 (say it total shuffle time) Executor Run Time - total shuffle time = 6792 total shuffle time = 20705, and Fetch Wait Time = 9377, so in the time of (20705-9377=11328), the task are doing other jobs, what does it do? the mapPartition? or the mapPartition is executed after shuffle completes? but the times calculated do not match. i'am so confused, need your help! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/something-about-pipeline-tp5626.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to read a multipart s3 file?
Thanks Nicholas! I looked at those docs several times without noticing that critical part you highlighted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: java.lang.StackOverflowError when calling count()
We are running into same issue. After 700 or so files the stack overflows, cache, persist checkpointing dont help. Basically checkpointing only saves the RDD when it is materialized it only materializes in the end, then it runs out of stack. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng men...@gmail.com wrote: You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote: Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the same version of Spark. The same line of code works correctly after quite some iterations. At the line of error, rdd__new.count() could be 0. (In some previous rounds, this was also 0 without any problem). Any thoughts on this? Thank you very much, - Guanhua CODE:print round, round, rdd__new.count() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 542, in count 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to java.lang.StackOverflowError [duplicate 1] return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times; aborting job File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 533, in sum 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state FAILED from TID 1774 because its task set is gone return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 499, in reduce vals = self.mapPartitions(func).collect() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 463, in collect bytesInJava = self._jrdd.collect().iterator() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o4317.collect. : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1 times (most recent failure: Exception failure: java.lang.StackOverflowError) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) == The stack overflow error is shown as follows: == 14/05/12 16:20:28 ERROR Executor: Exception in task ID 1774 java.lang.StackOverflowError at java.util.zip.Inflater.inflate(Inflater.java:259) at
Re: Doubts regarding Shark
The table will be cached but 10GB (Most likely more) would be on disk. You can check that in the storage tab in shark application. Java out of memory could be as your worker memory is too low or memory allocated to Shark is too low. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, May 8, 2014 at 12:42 AM, vinay Bajaj vbajaj2...@gmail.com wrote: Hello I have few questions regarding shark. 1) I have a table of 60 GB and i have total memory of 50 GB but when i try to cache the table it get cached successfully. How shark caches the table there was not enough memory to get the table in memory. And how cache eviction policies (FIFO and LRU) works while caching the table. While creating tables I am using cache type property as MEMORY (storage level: memory and disk) 2) Sometime while running queries I get JavaOutOfMemory Exception but all tables are cached successfully. Can you tell me the cases or some example due to which that error can come. Regards Vinay Bajaj
Re: Caching in graphX
Unfortunately it's very difficult to get uncaching right with GraphX due to the complicated internal dependency structure that it creates. It's necessary to know exactly what operations you're doing on the graph in order to unpersist correctly (i.e., in a way that avoids recomputation). I have a pull request (https://github.com/apache/spark/pull/497) that may make this a bit easier, but your best option is to use the Pregel API for iterative algorithms if possible. If that's not possible, leaving things cached has actually not been very costly in my experience, at least as long as VD and ED are primitive types to reduce the load on the garbage collector. Ankur -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482p5514.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is any idea on architecture based on Spark + Spray + Akka
We are using spray + Akka + spark stack at Alpine data labs Chester Sent from my iPhone On May 4, 2014, at 8:37 PM, ZhangYi yizh...@thoughtworks.com wrote: Hi all, Currently, our project is planning to adopt spark to be big data platform. For the client side, we decide expose REST api based on Spray. Our domain is focus on communication field for 3G and 4G user of processing some data analyst and statictics . Now, Spark + Spray is brand new for us, and we can't find any best practice via google. In our opinion, event-driven architecture is good choice for our project maybe. However, more idea is welcome. Thanks. -- ZhangYi (张逸) Developer tel: 15023157626 blog: agiledon.github.com weibo: tw张逸 Sent with Sparrow
no subject
unsubscribe -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, notify the sender immediately by return email and delete the message and any attachments from your system.
Re: Is there any problem on the spark mailing list?
i think so, fewer questions and answers these three days -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Variables outside of mapPartitions scope
In general, you can find out exactly what's not serializable by adding -Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS. Since a this reference to the enclosing class is often what's causing the problem, a general workaround is to move the mapPartitions call to a static method where there is no this reference. This transforms this: class A { def f() = rdd.mapPartitions(iter = ...)} into this: class A { def f() = A.helper(rdd)}object A { def helper(rdd: RDD[...]) = rdd.mapPartitions(iter = ...)} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: A new resource for getting examples of Spark RDD API calls
Hi Zhen, Thanks a lot for sharing. I'm sure it will be useful for new users. A small note: On the 'checkpoint' explanation: sc.setCheckpointDir(my_directory_name) it would be useful to specify that 'my_directory_name' should exist in all slaves. As an alternative you could use an HDFS directory URL as well. I've seen people tripping on that few times. -kr, Gerard. On Fri, May 9, 2014 at 11:54 PM, zhen z...@latrobe.edu.au wrote: Hi Everyone, I found it quite difficult to find good examples for Spark RDD API calls. So my student and I decided to go through the entire API and write examples for the vast majority of API calls (basically examples for anything that is remotely interesting). I think these examples maybe useful to other people. Hence I have put them up on my web site. There is also a pdf version that you can download from the web site. http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html Please let me know if you find any errors in them. Or any better examples you would like me to add into it. Hope you find it useful. Zhen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Caching in graphX
Hi, i'm writing this post because I would to know a caching approach for iterative algorithms in graphX. So far I was not able to keep stable the time of execution of each iteration. Can you achieve this condition? The code I used is this: var g = ... // my graph var prevG: Graph[VD, ED] = null var i = 0 while ( i maxIter ){ prevG = g g = g.foo() g = g.foo1() g = g.fooN() g.cache g.vertices.count + g.edges.count prevG.edges.unpersist() prevG.vertices.unpersist() } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: A new resource for getting examples of Spark RDD API calls
Great work!thanks! On May 13, 2014 3:16 AM, zhen z...@latrobe.edu.au wrote: Hi Everyone, I found it quite difficult to find good examples for Spark RDD API calls. So my student and I decided to go through the entire API and write examples for the vast majority of API calls (basically examples for anything that is remotely interesting). I think these examples maybe useful to other people. Hence I have put them up on my web site. There is also a pdf version that you can download from the web site. http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html Please let me know if you find any errors in them. Or any better examples you would like me to add into it. Hope you find it useful. Zhen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Variables outside of mapPartitions scope
Scala's for-loop is not just looping; it's not native looping in bytecode level. It will create a couple of objects at runtime and performs a truckload of method calls on them. As a result, if you are referring the variables outside the for-loop, the whole for-loop object and any variable inside the loop have to be serializable. Since the for-loop is serializable in scala, I guess you have something non-serializable inside the for-loop. The while-loop in scala is native, so you won't have this issue if you use while-loop. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, May 9, 2014 at 1:13 PM, pedro ski.rodrig...@gmail.com wrote: Right now I am not using any class variables (references to this). All my variables are created within the scope of the method I am running. I did more debugging and found this strange behavior. variables here for loop mapPartitions call use variables here end mapPartitions endfor This will result in a serializable bug, but this won't variables here for loop create new references to variables here mapPartitions call use new reference variables here end mapPartitions endfor -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Turn BLAS on MacOSX
Hi wxhsdp, See https://github.com/scalanlp/breeze/issues/142 and https://github.com/fommil/netlib-java/issues/60 for details. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13, 2014 at 2:17 AM, wxhsdp wxh...@gmail.com wrote: Hi, Xiangrui i compile openblas on ec2 m1.large, when breeze calls the native lib, error occurs: INFO: successfully loaded /mnt2/wxhsdp/libopenblas/lib/libopenblas_nehalemp-r0.2.9.rc2.so [error] (run-main-0) java.lang.UnsatisfiedLinkError: com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Ljava/lang/String;Ljava/lang/String;IIID[DII[DIID[DII)V java.lang.UnsatisfiedLinkError: com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Ljava/lang/String;Ljava/lang/String;IIID[DII[DIID[DII)V at com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Native Method) at com.github.fommil.netlib.NativeSystemBLAS.dgemm(NativeSystemBLAS.java:100) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Turn-BLAS-on-MacOSX-tpp5648.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Reading from .bz2 files with Spark
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes the problem you described, but it does contain several fixes to bzip2 format. -Xiangrui On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote: Hi all, Is anyone reading and writing to .bz2 files stored in HDFS from Spark with success? I'm finding the following results on a recent commit (756c96 from 24hr ago) and CDH 4.4.0: Works: val r = sc.textFile(/user/aa/myfile.bz2).count Doesn't work: val r = sc.textFile(/user/aa/myfile.bz2).map((s:String) = s+| ).count Specifically, I'm getting an exception coming out of the bzip2 libraries (see below stacktraces), which is unusual because I'm able to read from that file without an issue using the same libraries via Pig. It was originally created from Pig as well. Digging a little deeper I found this line in the .bz2 decompressor's javadoc for CBZip2InputStream: Instances of this class are not threadsafe. [source] My current working theory is that Spark has a much higher level of parallelism than Pig/Hadoop does and thus I get these wild IndexOutOfBounds exceptions much more frequently (as in can't finish a run over a little 2M row file) vs hardly at all in other libraries. The only other reference I could find to the issue was in presto-users, but the recommendation to leave .bz2 for .lzo doesn't help if I actually do want the higher compression levels of .bz2. Would love to hear if I have some kind of configuration issue or if there's a bug in .bz2 that's fixed in later versions of CDH, or generally any other thoughts on the issue. Thanks! Andrew Below are examples of some exceptions I'm getting: 14/05/07 15:09:49 WARN scheduler.TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 65535 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.hbCreateDecodeTables(CBZip2InputStream.java:663) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.createHuffmanDecodingTables(CBZip2InputStream.java:790) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:762) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:798) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:900) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$countPartition$1(RDD.scala:868)
Re: 1.0.0 Release Date?
Hi All, We are also waiting for this. Does anyone know of tentative date for this release ? We are at spark 0.8.0 right now. Should we wait for spark 1.0 or upgrade to spark 0.9.1 ? Thanks, Anurag Tangri On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote: Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Turn BLAS on MacOSX
Hi, How do I load native BLAS libraries on Mac ? I am getting the following errors while running LR and SVM with SGD: 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS centos it was fine...but on mac I am getting these warnings.. Also when it fails to run native blas does it use java code for BLAS operations ? May be after addition of breeze, we should add these details on a page as well so that users are aware of it before they report any performance results.. Thanks. Deb
Re: Spark to utilize HDFS's mmap caching
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: That API is something the HDFS administrator uses outside of any application to tell HDFS to cache certain files or directories. But once you’ve done that, any existing HDFS client accesses them directly from the cache. Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use that facility for adding files to the cache or anything like that. But yes, it does benefit from things already cached. On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote: Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a different thing; Spark applications that want to use it would have to explicitly call the respective HDFS APIs. On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs? http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit -- Marcelo -- Marcelo
Re: java.lang.StackOverflowError when calling count()
Thanks Xiangrui. After some debugging efforts, it turns out that the problem results from a bug in my code. But it's good to know that a long lineage could also lead to this problem. I will also try checkpointing to see whether the performance can be improved. Best regards, - Guanhua On 5/13/14 12:10 AM, Xiangrui Meng men...@gmail.com wrote: You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote: Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the same version of Spark. The same line of code works correctly after quite some iterations. At the line of error, rdd__new.count() could be 0. (In some previous rounds, this was also 0 without any problem). Any thoughts on this? Thank you very much, - Guanhua CODE:print round, round, rdd__new.count() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/ rdd.py, line 542, in count 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to java.lang.StackOverflowError [duplicate 1] return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times; aborting job File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/ rdd.py, line 533, in sum 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state FAILED from TID 1774 because its task set is gone return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/ rdd.py, line 499, in reduce vals = self.mapPartitions(func).collect() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/ rdd.py, line 463, in collect bytesInJava = self._jrdd.collect().iterator() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j -0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j -0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o4317.collect. : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1 times (most recent failure: Exception failure: java.lang.StackOverflowError) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$schedul er$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$schedul er$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scal a:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSch eduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DA GScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DA GScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:6 19) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun $receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Abstract Dispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.jav a:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.j ava:107) == The stack overflow error is shown as follows: == 14/05/12 16:20:28 ERROR Executor: Exception in task ID 1774 java.lang.StackOverflowError at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116) at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:231 0) at
Re: java.lang.StackOverflowError when calling count()
Count causes the overall performance to drop drastically. Infact beyond 50 files it starts to hang. if i force materialization. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, May 13, 2014 at 9:34 PM, Xiangrui Meng men...@gmail.com wrote: After checkPoint, call count directly to materialize it. -Xiangrui On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: We are running into same issue. After 700 or so files the stack overflows, cache, persist checkpointing dont help. Basically checkpointing only saves the RDD when it is materialized it only materializes in the end, then it runs out of stack. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng men...@gmail.com wrote: You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote: Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the same version of Spark. The same line of code works correctly after quite some iterations. At the line of error, rdd__new.count() could be 0. (In some previous rounds, this was also 0 without any problem). Any thoughts on this? Thank you very much, - Guanhua CODE:print round, round, rdd__new.count() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 542, in count 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to java.lang.StackOverflowError [duplicate 1] return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times; aborting job File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 533, in sum 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state FAILED from TID 1774 because its task set is gone return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 499, in reduce vals = self.mapPartitions(func).collect() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 463, in collect bytesInJava = self._jrdd.collect().iterator() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o4317.collect. : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1 times (most recent failure: Exception failure: java.lang.StackOverflowError) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at
Re: Spark to utilize HDFS's mmap caching
Great to know that! Thank you, Matei. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia matei.zaha...@gmail.com wrote: That API is something the HDFS administrator uses outside of any application to tell HDFS to cache certain files or directories. But once you've done that, any existing HDFS client accesses them directly from the cache. Matei On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote: Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a different thing; Spark applications that want to use it would have to explicitly call the respective HDFS APIs. On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs? http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit -- Marcelo