Spark's Behavior 2

2014-05-13 Thread Eduardo Costa Alfaia
Hi TD,

I have sent more informations now using 8 workers. The gap has been 27 sec now. 
Have you seen?
Thanks

BR
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Accuracy in mllib BinaryClassificationMetrics

2014-05-13 Thread Xiangrui Meng
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui

On Mon, May 12, 2014 at 1:26 PM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 I see precision and recall but no accuracy in mllib.evaluation.binary.

 Is it already under development or it needs to be added ?

 Thanks.
 Deb



Re: Dead lock running multiple Spark jobs on Mesos

2014-05-13 Thread Andrew Ash
Are you setting a core limit with spark.cores.max?  If you don't, in coarse
mode each Spark job uses all available cores on Mesos and doesn't let them
go until the job is terminated.  At which point the other job can access
the cores.

https://spark.apache.org/docs/latest/running-on-mesos.html -- Mesos Run
Modes section

The quick fix should be to set spark.cores.max to half of your cluster's
cores to support running two jobs concurrently.  Alternatively, switching
to fine-grained mode would help here too at the expense of higher latency
on startup.



On Mon, May 12, 2014 at 12:37 PM, Martin Weindel
martin.wein...@gmail.comwrote:

  I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos
 0.17.0.

 If I run a single Spark Job, the job runs fine on Mesos. Running multiple
 Spark Jobs also works, if I'm using the coarse-grained mode
 (spark.mesos.coarse = true).

 But if I run two Spark Jobs in parallel using the fine-grained mode, the
 jobs seem to block each other after a few seconds.
 And the Mesos UI reports no idle but also no used CPUs in this state.

 As soon as I kill one job, the other continues normally. See below for
 some log output.
 Looks to me as if something strange happens with the CPU resources.

 Can anybody give me a hint about the cause? The jobs read some HDFS files,
 but have no other communication to external processes.
 Or any other suggestions how to analyze this problem?

 Thanks,

 Martin

 -
 Here is the relevant log output of the driver of job1:

 INFO 17:53:09,247 Missing parents for Stage 2: List()
  INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at
 mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is now
 runnable
  INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
 (MapPartitionsRDD[9] at mapPartitions at
 HighTemperatureSpansPerLogfile.java:92)
  INFO 17:53:09,269 Adding task set 2.0 with 1 tasks

 

 *** at this point the job was killed ***


 log output of driver of job2:
  INFO 17:53:04,874 Missing parents for Stage 6: List()
  INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
 ComputeLogFileTimespan.java:71), which is now runnable
  INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23]
 at values at ComputeLogFileTimespan.java:71)
  INFO 17:53:04,882 Adding task set 6.0 with 1 tasks

 

 *** at this point the job 1 was killed ***
 INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
 20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL)
  INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
  INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to
 spark@ 
 sp...@ustst018-cep-node1.usu.usu.grp:40542myclusternode:40542sp...@ustst018-cep-node1.usu.usu.grp:40542

  INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes



something about pipeline

2014-05-13 Thread wxhsdp
Dear, all

   definition of fetch wait time:
   * Time the task spent waiting for remote shuffle blocks. This only
includes the time
   * blocking on shuffle input data. For instance if block B is being
fetched while the task is
   * still not finished processing block A, it is not considered to be
blocking on block B.

   by the definition of fetch wait time, can i make a conclusion that tasks
pipeline block fetch and the
   real work? how spark decides the task can be splitted by blocks to do the
pipeline?

  if the task is something like:

  val b = a.mapPartitions{ itr =
timeStamp
val arr = itr.toArray
...
timeStamp
arr.toIterator
  }

  can fetching blocks of RDD a and processing RDD b be pipelined?

here's the information of my task:
Launch Time:1399882225433
Finish Time:  1399882252948
Executor Run Time:27497
Shuffle Finish Time:1399882246138
Fetch Wait Time:9377
task time in a.mapPartitions is 8287 (say it mapPartition time)

Finish Time - Launch Time = 27515
Shuffle Finish Time - Launch Time = 20705 (say it total shuffle time)
Executor Run Time - total shuffle time = 6792

total shuffle time = 20705, and Fetch Wait Time = 9377, so in the time of
(20705-9377=11328), the
task are doing other jobs, what does it do? the mapPartition? or the
mapPartition is executed after 
shuffle completes? but the times calculated do not match. i'am so confused,
need your help!







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/something-about-pipeline-tp5626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to read a multipart s3 file?

2014-05-13 Thread kamatsuoka
Thanks Nicholas!  I looked at those docs several times without noticing that
critical part you highlighted.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
We are running into same issue. After 700 or so files the stack overflows,
cache, persist  checkpointing dont help.
Basically checkpointing only saves the RDD when it is materialized  it
only materializes in the end, then it runs out of stack.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng men...@gmail.com wrote:

 You have a long lineage that causes the StackOverflow error. Try
 rdd.checkPoint() and rdd.count() for every 20~30 iterations.
 checkPoint can cut the lineage. -Xiangrui

 On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote:
  Dear Sparkers:
 
  I am using Python spark of version 0.9.0 to implement some iterative
  algorithm. I got some errors shown at the end of this email. It seems
 that
  it's due to the Java Stack Overflow error. The same error has been
  duplicated on a mac desktop and a linux workstation, both running the
 same
  version of Spark.
 
  The same line of code works correctly after quite some iterations. At the
  line of error, rdd__new.count() could be 0. (In some previous rounds,
 this
  was also 0 without any problem).
 
  Any thoughts on this?
 
  Thank you very much,
  - Guanhua
 
 
  
  CODE:print round, round, rdd__new.count()
  
File
 
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
  line 542, in count
  14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
  java.lang.StackOverflowError [duplicate 1]
  return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
  aborting job
File
 
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
  line 533, in sum
  14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
 FAILED
  from TID 1774 because its task set is gone
  return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
File
 
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
  line 499, in reduce
  vals = self.mapPartitions(func).collect()
File
 
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
  line 463, in collect
  bytesInJava = self._jrdd.collect().iterator()
File
 
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 537, in __call__
File
 
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
  py4j.protocol.Py4JJavaError: An error occurred while calling
 o4317.collect.
  : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
 times
  (most recent failure: Exception failure: java.lang.StackOverflowError)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
  at scala.Option.foreach(Option.scala:236)
  at
 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at
 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at
 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
  ==
  The stack overflow error is shown as follows:
  ==
 
  14/05/12 16:20:28 ERROR Executor: Exception in task ID 1774
  java.lang.StackOverflowError
  at java.util.zip.Inflater.inflate(Inflater.java:259)
  at 

Re: Doubts regarding Shark

2014-05-13 Thread Mayur Rustagi
The table will be cached but 10GB (Most likely more) would be on disk. You
can check that in the storage tab in shark application.

Java out of memory could be as your worker memory is too low or memory
allocated to Shark is too low.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, May 8, 2014 at 12:42 AM, vinay Bajaj vbajaj2...@gmail.com wrote:

 Hello

 I have few questions regarding shark.

 1) I have a table of 60 GB and i have total memory of 50 GB but when i try
 to cache the table it get cached successfully. How shark caches the table
 there was not enough memory to get the table in memory. And how cache
 eviction policies (FIFO and LRU) works while caching the table. While
 creating tables I am using cache type property as MEMORY (storage level:
 memory and disk)

 2) Sometime while running queries I get JavaOutOfMemory Exception but all
 tables are cached successfully. Can you tell me the cases or some example
 due to which that error can come.

 Regards
 Vinay Bajaj



Re: Caching in graphX

2014-05-13 Thread ankurdave
Unfortunately it's very difficult to get uncaching right with GraphX due to
the complicated internal dependency structure that it creates. It's
necessary to know exactly what operations you're doing on the graph in order
to unpersist correctly (i.e., in a way that avoids recomputation).

I have a pull request (https://github.com/apache/spark/pull/497) that may
make this a bit easier, but your best option is to use the Pregel API for
iterative algorithms if possible.

If that's not possible, leaving things cached has actually not been very
costly in my experience, at least as long as VD and ED are primitive types
to reduce the load on the garbage collector.

Ankur



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482p5514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-13 Thread Chester At Yahoo
We are using spray + Akka + spark stack at Alpine data labs

Chester



Sent from my iPhone

 On May 4, 2014, at 8:37 PM, ZhangYi yizh...@thoughtworks.com wrote:
 
 Hi all,
 
 Currently, our project is planning to adopt spark to be big data platform. 
 For the client side, we decide expose REST api based on Spray. Our domain is 
 focus on communication field for 3G and 4G user of processing some data 
 analyst and statictics . Now, Spark + Spray is brand new for us, and we can't 
 find any best practice via google. 
 
 In our opinion, event-driven architecture is good choice for our project 
 maybe. However, more idea is welcome. Thanks.  
 
 -- 
 ZhangYi (张逸)
 Developer
 tel: 15023157626
 blog: agiledon.github.com
 weibo: tw张逸
 Sent with Sparrow
 


no subject

2014-05-13 Thread Herman, Matt (CORP)
unsubscribe

--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.


Re: Is there any problem on the spark mailing list?

2014-05-13 Thread wxhsdp
i think so, fewer questions and answers these three days



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Variables outside of mapPartitions scope

2014-05-13 Thread ankurdave
In general, you can find out exactly what's not serializable by adding
-Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS.
Since a this reference to the enclosing class is often what's causing the
problem, a general workaround is to move the mapPartitions call to a static
method where there is no this reference. This transforms this:
class A {  def f() = rdd.mapPartitions(iter = ...)}
into this:
class A {  def f() = A.helper(rdd)}object A {  def helper(rdd: RDD[...]) =
rdd.mapPartitions(iter = ...)}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Gerard Maas
Hi Zhen,

Thanks a lot for sharing. I'm sure it will be useful for new users.

A small note: On the 'checkpoint' explanation:
sc.setCheckpointDir(my_directory_name)
it would be useful to specify that 'my_directory_name' should exist in all
slaves. As an alternative you could use an HDFS directory URL as well.
I've seen people tripping on that few times.

-kr, Gerard.



On Fri, May 9, 2014 at 11:54 PM, zhen z...@latrobe.edu.au wrote:

 Hi Everyone,

 I found it quite difficult to find good examples for Spark RDD API calls.
 So
 my student and I decided to go through the entire API and write examples
 for
 the vast majority of API calls (basically examples for anything that is
 remotely interesting). I think these examples maybe useful to other people.
 Hence I have put them up on my web site. There is also a pdf version that
 you can download from the web site.

 http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

 Please let me know if you find any errors in them. Or any better examples
 you would like me to add into it.

 Hope you find it useful.

 Zhen



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Caching in graphX

2014-05-13 Thread Franco Avi
Hi, i'm writing this post because I would to know a caching approach for
iterative algorithms in graphX. So far I was not able to keep stable the
time of execution of each iteration. Can you achieve this condition?
The code I used is this:

var g = ... // my graph
var prevG: Graph[VD, ED] = null

var i = 0
while ( i  maxIter ){

prevG = g
   
g = g.foo()
g = g.foo1()
g = g.fooN()
   
g.cache
g.vertices.count + g.edges.count

prevG.edges.unpersist()
prevG.vertices.unpersist()

}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Flavio Pompermaier
Great work!thanks!
On May 13, 2014 3:16 AM, zhen z...@latrobe.edu.au wrote:

 Hi Everyone,

 I found it quite difficult to find good examples for Spark RDD API calls.
 So
 my student and I decided to go through the entire API and write examples
 for
 the vast majority of API calls (basically examples for anything that is
 remotely interesting). I think these examples maybe useful to other people.
 Hence I have put them up on my web site. There is also a pdf version that
 you can download from the web site.

 http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

 Please let me know if you find any errors in them. Or any better examples
 you would like me to add into it.

 Hope you find it useful.

 Zhen



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
Scala's for-loop is not just looping; it's not native looping in bytecode
level. It will create a couple of objects at runtime and performs a
truckload of method calls on them. As a result, if you are referring the
variables outside the for-loop, the whole for-loop object and any variable
inside the loop have to be serializable. Since the for-loop is serializable
in scala, I guess you have something non-serializable inside the for-loop.

The while-loop in scala is native, so you won't have this issue if you use
while-loop.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, May 9, 2014 at 1:13 PM, pedro ski.rodrig...@gmail.com wrote:

 Right now I am not using any class variables (references to this). All my
 variables are created within the scope of the method I am running.

 I did more debugging and found this strange behavior.
 variables here
 for loop
 mapPartitions call
 use variables here
 end mapPartitions
 endfor

 This will result in a serializable bug, but this won't

 variables here
 for loop
 create new references to variables here
 mapPartitions call
 use new reference variables here
 end mapPartitions
 endfor



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Turn BLAS on MacOSX

2014-05-13 Thread DB Tsai
Hi wxhsdp,

See https://github.com/scalanlp/breeze/issues/142 and
https://github.com/fommil/netlib-java/issues/60 for details.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, May 13, 2014 at 2:17 AM, wxhsdp wxh...@gmail.com wrote:

 Hi, Xiangrui

   i compile openblas on ec2 m1.large, when breeze calls the native lib,
 error occurs:

 INFO: successfully loaded
 /mnt2/wxhsdp/libopenblas/lib/libopenblas_nehalemp-r0.2.9.rc2.so
 [error] (run-main-0) java.lang.UnsatisfiedLinkError:

 com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Ljava/lang/String;Ljava/lang/String;IIID[DII[DIID[DII)V
 java.lang.UnsatisfiedLinkError:

 com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Ljava/lang/String;Ljava/lang/String;IIID[DII[DIID[DII)V
 at com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Native
 Method)
 at
 com.github.fommil.netlib.NativeSystemBLAS.dgemm(NativeSystemBLAS.java:100)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Turn-BLAS-on-MacOSX-tpp5648.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Reading from .bz2 files with Spark

2014-05-13 Thread Xiangrui Meng
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui

On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote:
 Hi all,

 Is anyone reading and writing to .bz2 files stored in HDFS from Spark with
 success?


 I'm finding the following results on a recent commit (756c96 from 24hr ago)
 and CDH 4.4.0:

 Works: val r = sc.textFile(/user/aa/myfile.bz2).count
 Doesn't work: val r = sc.textFile(/user/aa/myfile.bz2).map((s:String) =
 s+|  ).count

 Specifically, I'm getting an exception coming out of the bzip2 libraries
 (see below stacktraces), which is unusual because I'm able to read from that
 file without an issue using the same libraries via Pig.  It was originally
 created from Pig as well.

 Digging a little deeper I found this line in the .bz2 decompressor's javadoc
 for CBZip2InputStream:

 Instances of this class are not threadsafe. [source]


 My current working theory is that Spark has a much higher level of
 parallelism than Pig/Hadoop does and thus I get these wild IndexOutOfBounds
 exceptions much more frequently (as in can't finish a run over a little 2M
 row file) vs hardly at all in other libraries.

 The only other reference I could find to the issue was in presto-users, but
 the recommendation to leave .bz2 for .lzo doesn't help if I actually do want
 the higher compression levels of .bz2.


 Would love to hear if I have some kind of configuration issue or if there's
 a bug in .bz2 that's fixed in later versions of CDH, or generally any other
 thoughts on the issue.


 Thanks!
 Andrew



 Below are examples of some exceptions I'm getting:

 14/05/07 15:09:49 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ArrayIndexOutOfBoundsException
 java.lang.ArrayIndexOutOfBoundsException: 65535
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.hbCreateDecodeTables(CBZip2InputStream.java:663)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.createHuffmanDecodingTables(CBZip2InputStream.java:790)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:762)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:798)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
 at
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
 at java.io.InputStream.read(InputStream.java:101)
 at
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
 at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
 at
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
 at
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)




 java.lang.ArrayIndexOutOfBoundsException: 90
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:900)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
 at
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
 at java.io.InputStream.read(InputStream.java:101)
 at
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
 at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
 at
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
 at
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
 at
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
 at
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$countPartition$1(RDD.scala:868)



 

Re: 1.0.0 Release Date?

2014-05-13 Thread Anurag Tangri
Hi All,
We are also waiting for this. Does anyone know of tentative date for this
release ?

We are at spark 0.8.0 right now.  Should we wait for spark 1.0 or upgrade
to spark 0.9.1 ?


Thanks,
Anurag Tangri



On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote:

 Can anyone comment on the anticipated date or worse case timeframe for when
 Spark 1.0.0 will be released?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Turn BLAS on MacOSX

2014-05-13 Thread Debasish Das
Hi,

How do I load native BLAS libraries on Mac ?

I am getting the following errors while running LR and SVM with SGD:

14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS

14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS

centos it was fine...but on mac I am getting these warnings..

Also when it fails to run native blas does it use java code for BLAS
operations ?

May be after addition of breeze, we should add these details on a page as
well so that users are aware of it before they report any performance
results..

Thanks.

Deb


Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 That API is something the HDFS administrator uses outside of any application 
 to tell HDFS to cache certain files or directories. But once you’ve done 
 that, any existing HDFS client accesses them directly from the cache.

Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use
that facility for adding files to the cache or anything like that. But
yes, it does benefit from things already cached.


 On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote:

 Is that true? I believe that API Chanwit is talking about requires
 explicitly asking for files to be cached in HDFS.

 Spark automatically benefits from the kernel's page cache (i.e. if
 some block is in the kernel's page cache, it will be read more
 quickly). But the explicit HDFS cache is a different thing; Spark
 applications that want to use it would have to explicitly call the
 respective HDFS APIs.

 On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Yes, Spark goes through the standard HDFS client and will automatically 
 benefit from this.

 Matei

 On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
 sc.textFile() and other HDFS-related APIs?

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit




 --
 Marcelo




-- 
Marcelo


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Guanhua Yan
Thanks Xiangrui. After some debugging efforts, it turns out that the
problem results from a bug in my code. But it's good to know that a long
lineage could also lead to this problem. I will also try checkpointing to
see whether the performance can be improved.

Best regards,
- Guanhua

On 5/13/14 12:10 AM, Xiangrui Meng men...@gmail.com wrote:

You have a long lineage that causes the StackOverflow error. Try
rdd.checkPoint() and rdd.count() for every 20~30 iterations.
checkPoint can cut the lineage. -Xiangrui

On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote:
 Dear Sparkers:

 I am using Python spark of version 0.9.0 to implement some iterative
 algorithm. I got some errors shown at the end of this email. It seems
that
 it's due to the Java Stack Overflow error. The same error has been
 duplicated on a mac desktop and a linux workstation, both running the
same
 version of Spark.

 The same line of code works correctly after quite some iterations. At
the
 line of error, rdd__new.count() could be 0. (In some previous rounds,
this
 was also 0 without any problem).

 Any thoughts on this?

 Thank you very much,
 - Guanhua


 
 CODE:print round, round, rdd__new.count()
 
   File
 
/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
rdd.py,
 line 542, in count
 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
 java.lang.StackOverflowError [duplicate 1]
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
 aborting job
   File
 
/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
rdd.py,
 line 533, in sum
 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
FAILED
 from TID 1774 because its task set is gone
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 
/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
rdd.py,
 line 499, in reduce
 vals = self.mapPartitions(func).collect()
   File
 
/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
rdd.py,
 line 463, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File
 
/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j
-0.8.1-src.zip/py4j/java_gateway.py,
 line 537, in __call__
   File
 
/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j
-0.8.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
o4317.collect.
 : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
times
 (most recent failure: Exception failure: java.lang.StackOverflowError)
 at
 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$schedul
er$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at
 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$schedul
er$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at
 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scal
a:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSch
eduler$$abortStage(DAGScheduler.scala:1026)
 at
 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DA
GScheduler.scala:619)
 at
 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DA
GScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
 at
 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:6
19)
 at
 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun
$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Abstract
Dispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.jav
a:1339)
 at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.j
ava:107)

 ==
 The stack overflow error is shown as follows:
 ==

 14/05/12 16:20:28 ERROR Executor: Exception in task ID 1774
 java.lang.StackOverflowError
 at java.util.zip.Inflater.inflate(Inflater.java:259)
 at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152)
 at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116)
 at
 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:231
0)
 at
 

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
Count causes the overall performance to drop drastically. Infact beyond 50
files it starts to hang. if i force materialization.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Tue, May 13, 2014 at 9:34 PM, Xiangrui Meng men...@gmail.com wrote:

 After checkPoint, call count directly to materialize it. -Xiangrui

 On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:
  We are running into same issue. After 700 or so files the stack
 overflows,
  cache, persist  checkpointing dont help.
  Basically checkpointing only saves the RDD when it is materialized  it
 only
  materializes in the end, then it runs out of stack.
 
  Regards
  Mayur
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  You have a long lineage that causes the StackOverflow error. Try
  rdd.checkPoint() and rdd.count() for every 20~30 iterations.
  checkPoint can cut the lineage. -Xiangrui
 
  On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote:
   Dear Sparkers:
  
   I am using Python spark of version 0.9.0 to implement some iterative
   algorithm. I got some errors shown at the end of this email. It seems
   that
   it's due to the Java Stack Overflow error. The same error has been
   duplicated on a mac desktop and a linux workstation, both running the
   same
   version of Spark.
  
   The same line of code works correctly after quite some iterations. At
   the
   line of error, rdd__new.count() could be 0. (In some previous rounds,
   this
   was also 0 without any problem).
  
   Any thoughts on this?
  
   Thank you very much,
   - Guanhua
  
  
   
   CODE:print round, round, rdd__new.count()
   
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 542, in count
   14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
   java.lang.StackOverflowError [duplicate 1]
   return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
   aborting job
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 533, in sum
   14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
   FAILED
   from TID 1774 because its task set is gone
   return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 499, in reduce
   vals = self.mapPartitions(func).collect()
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 463, in collect
   bytesInJava = self._jrdd.collect().iterator()
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 537, in __call__
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
   line 300, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling
   o4317.collect.
   : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
   times
   (most recent failure: Exception failure: java.lang.StackOverflowError)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
   at
  
  
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
  
   org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
   at scala.Option.foreach(Option.scala:236)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at
  
  
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at
 

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Chanwit Kaewkasi
Great to know that! Thank you, Matei.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 That API is something the HDFS administrator uses outside of any application 
 to tell HDFS to cache certain files or directories. But once you've done 
 that, any existing HDFS client accesses them directly from the cache.

 Matei

 On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote:

 Is that true? I believe that API Chanwit is talking about requires
 explicitly asking for files to be cached in HDFS.

 Spark automatically benefits from the kernel's page cache (i.e. if
 some block is in the kernel's page cache, it will be read more
 quickly). But the explicit HDFS cache is a different thing; Spark
 applications that want to use it would have to explicitly call the
 respective HDFS APIs.

 On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Yes, Spark goes through the standard HDFS client and will automatically 
 benefit from this.

 Matei

 On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
 sc.textFile() and other HDFS-related APIs?

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit




 --
 Marcelo