Block

2014-03-11 Thread David Thomas
What is the concept of Block and BlockManager in Spark? How is a Block related to a Partition of a RDD?

pyspark broadcast error

2014-03-11 Thread Brad Miller
Hi All, When I run the program shown below, I receive the error shown below. I am running the current version of branch-0.9 from github. Note that I do not receive the error when I replace 2 ** 29 with 2 ** X, where X 29. More interestingly, I do not receive the error when X = 30, and when X

building spark over proxy

2014-03-11 Thread hades dark
Can someone help me on how to build spark over proxy settings .. -- REGARDS ASHUTOSH JAIN IIT-BHU VARANASI

Re: building spark over proxy

2014-03-11 Thread Bharath Vissapragada
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E On Tue, Mar 11, 2014 at 11:44 AM, hades dark hades.o...@gmail.com wrote: Can someone help me on how to build spark over proxy settings .. -- REGARDS

Reading sequencefile

2014-03-11 Thread Jaonary Rabarisoa
Hi all, I'm trying to read a sequenceFile that represent a set of jpeg image generated using this tool : http://stuartsierra.com/2008/04/24/a-million-little-files . According to the documentation : Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a

Spark stand alone cluster mode

2014-03-11 Thread Gino Mathews
Hi, I am new to spark. I would like to run jobs in Spark stand alone cluster mode. No cluser managers other than spark is used. (https://spark.apache.org/docs/0.9.0/spark-standalone.html) I have tried wordcount from spark shell and stand alone scala app. The code reads input from HDFS and

Re: Spark stand alone cluster mode

2014-03-11 Thread Yana Kadiyska
does sbt show full-classpath show spark-core on the classpath? I am still pretty new to scala but it seems like you have val sparkCore = org.apache.spark %% spark-core% V.spark % provided -- I believe the provided part means it's in your classpath. Spark-shell script sets up

Pyspark Memory Woes

2014-03-11 Thread Aaron Olson
Dear Sparkians, We are working on a system to do relational modeling on top of Spark, all done in pyspark. While we've been learning a lot about Spark internals so far, we're currently running into memory issues and wondering how best to profile to fix them. Here are our symptoms: - We're

Spark usage patterns and questions

2014-03-11 Thread Sourav Chandra
Hi, I have some questions regarding usage patterns and debugging in spark/spark streaming. 1. What is some used design patterns of using broadcast variable? In my application i created some and also created a scheduled task which periodically refreshes the variables. I want to know how

Re: NO SUCH METHOD EXCEPTION

2014-03-11 Thread Matei Zaharia
Since it’s from Scala, it might mean you’re running with a different version of Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while Spark 0.9 uses Scala 2.10. Matei On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia) arockia.r.jeya...@verizon.com wrote: Hi,

Re: Powered By Spark Page -- Companies Organizations

2014-03-11 Thread Matei Zaharia
Thanks, added you. On Mar 11, 2014, at 2:47 AM, Christoph Böhm listenbru...@gmx.net wrote: Dear Spark team, thanks for the great work and congrats on becoming an Apache top-level project! You could add us to your Powered-by-page, because we are using Spark (and Shark) to perform

Re: Pyspark Memory Woes

2014-03-11 Thread Sandy Ryza
Hi Aaron, When you say Java heap space is 1.5G per worker, 24 or 32 cores across 46 nodes. It seems like we should have more than enough to do this comfortably., how are you configuring this? -Sandy On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson aaron.ol...@shopify.comwrote: Dear Sparkians,

is spark.cleaner.ttl safe?

2014-03-11 Thread Michael Allman
Hello, I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by

RE: unsubscribe

2014-03-11 Thread Kapil Malik
Ohh ! I thought you're unsubscribing :) Kapil Malik | kma...@adobe.com | 33430 / 8800836581 -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: 12 March 2014 00:51 To: user@spark.apache.org Subject: Re: unsubscribe To unsubscribe from this list, please

Re: Pyspark Memory Woes

2014-03-11 Thread Sandy Ryza
Are you aware that you get an executor (and the 1.5GB) per machine, not per core? On Tue, Mar 11, 2014 at 12:52 PM, Aaron Olson aaron.ol...@shopify.comwrote: Hi Sandy, We're configuring that with the JAVA_OPTS environment variable in $SPARK_HOME/spark-worker-env.sh like this: # JAVA OPTS

Re: Out of memory on large RDDs

2014-03-11 Thread Grega Kespret
Your input data read as RDD may be causing OOM, so thats where you can use different memory configuration. We are not getting any OOM exceptions, just akka future timeouts in mapoutputtracker and unsuccessful get of shuffle outputs, therefore refetching them. What is the industry practice

possible bug in Spark's ALS implementation...

2014-03-11 Thread Michael Allman
Hi, I'm implementing a recommender based on the algorithm described in http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the basis for Spark's ALS implementation for data sets with implicit features. The data set I'm working with is proprietary and I cannot share it,

Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Xiangrui Meng
Hi Michael, I can help check the current implementation. Would you please go to https://spark-project.atlassian.net/browse/SPARK and create a ticket about this issue with component MLlib? Thanks! Best, Xiangrui On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman m...@allman.ms wrote: Hi, I'm

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread wallacemann
Ah! Thank you. That'll work for now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Applications for Spark on HDFS

2014-03-11 Thread Sandy Ryza
Hi Paul, What do you mean by distributing the jars manually? If you register jars that are local to the client with SparkContext.addJars, Spark should handle distributing them to the workers. Are you taking advantage of this? -Sandy On Tue, Mar 11, 2014 at 3:09 PM, Paul Schooss

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread wallacemann
In a similar vein, it would be helpful to have an Iterable way to access the data inside an RDD. The collect method takes everything in the RDD and puts in a list, but this blows up memory. Since everything I want is already inside the RDD, it could be easy to iterate over the content without

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread Mark Hamstra
https://github.com/apache/incubator-spark/pull/421 Works pretty good, but really needs to be enhanced to work with AsyncRDDActions. On Tue, Mar 11, 2014 at 4:50 PM, wallacemann wall...@bandpage.com wrote: In a similar vein, it would be helpful to have an Iterable way to access the data

Re: RDD.saveAs...

2014-03-11 Thread Matei Zaharia
I agree that we can’t keep adding these to the core API, partly because it will get unwieldy to maintain and partly just because each storage system will bring in lots of dependencies. We can simply have helper classes in different modules for each storage system. There’s some discussion on

Re: Block

2014-03-11 Thread dachuan
In my opinion, BlockManager manages many types of Block, RDD's partition, a.k.a. RDDBlock, is one type of them. Other types of Blocks are ShuffleBlock, IndirectBlock (if the task's return status is too large), etc. So, BlockManager is a layer that is independent of RDD concept. On Mar 11, 2014

Re: Are all transformations lazy?

2014-03-11 Thread Ewen Cheslack-Postava
You should probably be asking the opposite question: why do you think it *should* be applied immediately? Since the driver program hasn't requested any data back (distinct generates a new RDD, it doesn't return any data), there's no need to actually compute anything yet. As the documentation

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
I think you misunderstood my question - I should have stated it better. I'm not saying it should be applied immediately, but I'm trying to understand how Spark achieves this lazy computation transformations. May be this is due to my ignorance of how Scala works, but when I see the code, I see that