Re: Problem building and publishing Spark 0.8.0 incubator - java command gets killed

2013-11-20 Thread Pankhuri Gupta
Thanks for the help. I will try deploying spark on a larger instance and then get back. Best, Pankhuri On Nov 21, 2013, at 2:30 AM, Prashant Sharma wrote: > You mean t1.micro, The ram is less than a GB (615 MB) on those instances. It > will not build. The size you are referring to is p

Re: Problem building and publishing Spark 0.8.0 incubator - java command gets killed

2013-11-20 Thread Prashant Sharma
You mean t1.micro, The ram is less than a GB (615 MB) on those instances. It will not build. The size you are referring to is probably the storage size and not RAM. It might not be worth trying out spark on such instances. However if you plan to upgrade, chose atleast m1.large instances and then pr

Re: Problem building and publishing Spark 0.8.0 incubator - java command gets killed

2013-11-20 Thread Pankhuri Gupta
The instance type is "ti.micro" with size as 7.9GB out of which 4.3GB is still available. For running spark (and later on hadoop), should i use a Storage Optimized instance or it can work on this as well? On Nov 20, 2013, at 11:39 AM, Prashant Sharma wrote: > What is the instance type ?. Use a

time taken to fetch input partition by map

2013-11-20 Thread Umar Javed
Hi, The metrics provide information for the reduce (i.e. shuffleReaders) tasks about the time taken to fetch the shuffle outputs. Is there a way I can find out the the time taken by a map task (ie shuffleWriter) on a remote machine to read its input partition from disk? I believe I should look in

DFSBroadcastFactory

2013-11-20 Thread Dmitriy Lyubimov
I see this stuff (DFSBroadcastFactory) is gone since 0.6. Why? My problem is spark clients behind NAT. For various reasons, traffic to HttpFactory cannot be forwarded. But broadcasting via hdfs would work for my purposes. Any suggestions? Thanks in advance. -Dmitriy

Re: How to unpersist JavaPairRDD

2013-11-20 Thread sasmita Patra
Thanks Josh On Wed, Nov 20, 2013 at 5:01 PM, Josh Rosen wrote: > JavaPairRDD should have had a unpersist() method; we'll fix this bug in > 0.8.1 (see https://github.com/apache/incubator-spark/pull/103). In the > meantime, just call myJavaPairRDD.rdd().unpersist() (see > https://mail-archives.a

Re: How to unpersist JavaPairRDD

2013-11-20 Thread Josh Rosen
JavaPairRDD should have had a unpersist() method; we'll fix this bug in 0.8.1 (see https://github.com/apache/incubator-spark/pull/103). In the meantime, just call myJavaPairRDD.rdd().unpersist() (see https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccaoepxp5syqn7v9gdomj

Trouble with MLbase exercise Documentation

2013-11-20 Thread sudhir vaidya
I am a beginner and have started to go through the Mlbase exercises. But i get a java.io.indexoutofbounds.exception when i run the first command of step 2.1 here : http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html All i am doing is Copying the command and pasting it to the

How to unpersist JavaPairRDD

2013-11-20 Thread sasmita Patra
Hi, I have two datasets that i load from HDFS file system. After loading the files, i cache the datasets. I have a requirement to do join on these datasets (LEFT/RIGHT INNER/OUTER JOIN) and apply some filter conditions and then run multiple queries on this joined filtered dataset. I have created

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Dmitriy Lyubimov
ok. but i would beware of possible leaks associated with tons of session recycling. at least i did not stress it enough even in standalone mode to assert meaningful leakage. I only tried for concurrency problems. On Wed, Nov 20, 2013 at 1:56 PM, Mingyu Kim wrote: > Yea, we want multiple contex

Re: Re: Spark Configuration with Python

2013-11-20 Thread Ewen Cheslack-Postava
You can use the SPARK_MEM environment variable instead of setting the system property. If you need to set other properties that can't be controlled by environment variables (which is why I wrote that patch), you can just apply that patch directly to your binary package -- it only patches a Python

Fwd: Re: Spark Configuration with Python

2013-11-20 Thread Michal Romaniuk
Patrick: It looks to me like this configures the cluster before startup. The setting that I want to change is the amount of memory available to each task (by default it's 512m). It appears that this is a property of the job itself rather than the cluster. Josh: I'm not sure about getting the lates

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Mingyu Kim
Yea, we want multiple contexts for isolation of imported jars. We¹d like our users to submit jobs with their own versions of helper libraries (ones that they write) and updating jars at runtime can break a lot of systems. Okay, so we¹d need to stick with the standalone mode if we were to use multi

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Dmitriy Lyubimov
Oh. i suppose if you mean you may install backend closures at will -- yes, this would present problem in the sense that since session is already set up, one cannot update its backend closures. Alas, fwiw spark-0.8 doesn't make any claims officially as to concurrency guarantees of multiple contexts

Re: Job cancellation

2013-11-20 Thread Mingyu Kim
Awesome! That¹s exactly what I needed. Is there any estimated timeline for 0.8.1 release? Mingyu From: Mark Hamstra Reply-To: "user@spark.incubator.apache.org" Date: Wednesday, November 20, 2013 at 4:06 AM To: user Subject: Re: Job cancellation Job cancellation has been in both 0.8.1 SNA

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Dmitriy Lyubimov
On Wed, Nov 20, 2013 at 12:56 PM, Matt Cheah wrote: > Our use case is trying to isolate the classes in shipped jars that are > available to different users of the Spark context. > > We have multiple users that can run against our Web Server JVM which > will host a Spark Context. These users can

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Matt Cheah
Our use case is trying to isolate the classes in shipped jars that are available to different users of the Spark context. We have multiple users that can run against our Web Server JVM which will host a Spark Context. These users can compile their own jars and query the SparkContext to add them

Re: Multiple SparkContextx in one JVM

2013-11-20 Thread Dmitriy Lyubimov
As far as i can tell, the mesos back end would still not work correctly with multiple SparkContexts. However, if you are just after spark query concurrency, spark 0.8 seems to be supporting concurrent (reentrant) requests to the same session (SparkContext). One should also be able to use FAIR sche

Shark cached table eviction

2013-11-20 Thread Silvio Fiorito
Is there a way to programmatically evict a cached table from the cluster cache? Is it just a matter of dropping the table or is there something else available?

Re: Spark Configuration with Python

2013-11-20 Thread Josh Rosen
A recent pull request added a classmethod to PySpark's SparkContext that allows you to configure the Java system properties from Python: https://github.com/apache/incubator-spark/pull/97 On Wed, Nov 20, 2013 at 10:34 AM, Patrick Wendell wrote: > You can add java options in SPARK_JAVA_OPTS insid

Re: Spark Configuration with Python

2013-11-20 Thread Patrick Wendell
You can add java options in SPARK_JAVA_OPTS inside of conf/spark-env.sh http://spark.incubator.apache.org/docs/latest/python-programming-guide.html#installing-and-configuring-pyspark - Patrick On Wed, Nov 20, 2013 at 8:52 AM, Michal Romaniuk wrote: > The info about configuration options is avai

Re: Joining files

2013-11-20 Thread Alex Boisvert
On Nov 20, 2013 8:34 AM, "Something Something" wrote: > > Questions: > > 1) I don't see APIs for LEFT, FULL OUTER Joins. True? > 2) Apache Pig provides different join types such as 'replicated', 'skewed'. Now 'replicated' may not be a concern in Spark 'cause everything happens in memory (possi

Re: Joining files

2013-11-20 Thread Alex Boisvert
On Nov 20, 2013 8:34 AM, "Something Something" wrote: > > Questions: > > 1) I don't see APIs for LEFT, FULL OUTER Joins. True? The join operations are so documented here: http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions > 2) Apache Pig pr

Spark Configuration with Python

2013-11-20 Thread Michal Romaniuk
The info about configuration options is available at the link below, but this seems to only work with Java. How can those options be set from Python? http://spark.incubator.apache.org/docs/latest/configuration.html#system-properties Thanks, Michal

Re: Problem building and publishing Spark 0.8.0 incubator - java command gets killed

2013-11-20 Thread Prashant Sharma
What is the instance type ?. Use an instance with atleast 4Gb+ RAM. I don't think it is possible to build on less than that. Other option would be to use prebuilt binary. On Wed, Nov 20, 2013 at 8:56 PM, Pankhuri Gupta wrote: > Hi, > I am new to Spark and Scala. As a part of one of my projects

Re: Joining files

2013-11-20 Thread Something Something
Questions: 1) I don't see APIs for LEFT, FULL OUTER Joins. True? 2) Apache Pig provides different join types such as 'replicated', 'skewed'. Now 'replicated' may not be a concern in Spark 'cause everything happens in memory (possibly). 3) Does the 'join' (which seems to work like INNER Join)

Problem building and publishing Spark 0.8.0 incubator - java command gets killed

2013-11-20 Thread Pankhuri Gupta
Hi, I am new to Spark and Scala. As a part of one of my projects, I am trying to build and locally publish spark-0.8.0-incubating on an Amazon ec2 cluster. After setting up all the java class paths and options, when I run : ** sbt/sbt compile , OR ** sbt/

running transformation on group of RDDs concurrently

2013-11-20 Thread Yadid Ayzenberg
Assuming I also want to run n concurrent jobs of the following type: each RDD is of the same form (JavaPairRDD), and I would like to run the same transformation on all RDDs. The brute force way would be to instantiate n threads and submit a job from each thread. Would this way be valid as w

Re: Job cancellation

2013-11-20 Thread Mark Hamstra
Job cancellation has been in both 0.8.1 SNAPSHOT and 0.9.0 SNAPSHOT for awhile now: PR29 , PR74. Modification/improvement of job cancellation is part of the open pull request PR190

Job cancellation

2013-11-20 Thread Mingyu Kim
Hi all, Cancellation seems to be supported at application level. In other words, you can call stop() on your instance of SparkContext in order to stop the computation associated with the SparkContext. Is there any way to cancel a job? (To be clear, job is "a parallel computation consisting of mult

Multiple SparkContextx in one JVM

2013-11-20 Thread Mingyu Kim
Hi all, I¹ve been searching to find out the current status of the multiple SparkContext support in one JVM. I found https://groups.google.com/forum/#!topic/spark-developers/GLx8yunSj0A and https://groups.google.com/forum/#!topic/spark-users/cOYP96I668I. According to the threads, I should be able t

Re: is this possible in Spark? ( Serialization related)

2013-11-20 Thread Eugen Cepoi
You can try broadcasting it. To avoid the not serializable problem I am using Kryo, you can try the same. Eugen 2013/11/20 Pranay Tonpay > Jason.. I tried this, using Java code and it didn’t work still…. > > > > Is there any workaround for this problem ? > > > > Thx > > pranay > > > > *From:*

RE: is this possible in Spark? ( Serialization related)

2013-11-20 Thread Pranay Tonpay
Jason.. I tried this, using Java code and it didn't work still Is there any workaround for this problem ? Thx pranay From: Jason Lenderman [mailto:jslender...@gmail.com] Sent: Wednesday, November 13, 2013 12:14 PM To: user@spark.incubator.apache.org Subject: Re: is this possible in Spark? (

Re: Reuse the Buffer Array in the map function?

2013-11-20 Thread Wenlei Xie
Thank you! Best, Wenlei On Tue, Nov 19, 2013 at 7:20 AM, Mark Hamstra wrote: > mapWith can make this use case even simpler. > > > > On Nov 19, 2013, at 1:29 AM, Sebastian Schelter > wrote: > > You can use mapPartition, which allows you to apply the map function > elementwise to all elements of