Re: how to save RDD partitions in different folders?

2014-04-07 Thread dmpour23
Can you provide an example? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

hang on sorting operation

2014-04-07 Thread Stuart Zakon
I am seeing a small standalone cluster (master, slave) hang when I reach a certain memory threshold, but I cannot detect how to configure memory to avoid this. I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this allocated, but it does not help. The reduce is by key to get

Recommended way to develop spark application with both java and python

2014-04-07 Thread Wush Wu
Dear all, We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are familiar with python, but some of features are developed under java. I am looking for a way to integrate java and python on spark. I notice that the initialization of pyspark does not include a field to distribute

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote: Hello everyone, I have

Require some clarity on partitioning

2014-04-07 Thread Sanjay Awatramani
Hi, I was going through Matei's Advanced Spark presentation at  https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions. The presentation of this video is at  http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf The PageRank example

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Hi All, I wanted Spark on Yarn to up and running. I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly* Then i ran *SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

Re: How to create a RPM package

2014-04-07 Thread Will Benton
For issue #2 I was concerned that the build packaging had to be internal. So I am using the already packaged make-distribution.sh (modified to use a maven build) to create a tar ball which I then package it using a RPM spec file. Hi Rahul, so the issue for downstream operating system

PySpark SocketConnect Issue in Cluster

2014-04-07 Thread Surendranauth Hiraman
Hi, We have a situation where a Pyspark script works fine as a local process (local url) on the Master and the Worker nodes, which would indicate that all python dependencies are set up properly on each machine. But when we try to run the script at the cluster level (using the master's url), if

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're

Re: reduceByKeyAndWindow Java

2014-04-07 Thread Eduardo Costa Alfaia
Hi TD Could you explain me this code part? .reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public

AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to

SparkContext.addFile() and FileNotFoundException

2014-04-07 Thread Thierry Herrmann
Hi, I'm trying to use SparkContext.addFile() to propagate a file to worker nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the master). I don't have HDFS or any distributed file system. Just playing with basic stuff. Here's the code in my driver (actually spark-shell

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
That work is under submission at an academic conference and will be made available if/when the paper is published. In terms of algorithms for hyperparameter tuning, we consider Grid Search, Random Search, a couple of older derivative-free optimization methods, and a few newer methods - TPE (aka

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Yana Kadiyska
I might be wrong here but I don't believe it's discouraged. Maybe part of the reason there's not a lot of examples is that sql2rdd returns an RDD (TableRDD that is https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala). I haven't done anything too complicated yet but

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW,

Driver Out of Memory

2014-04-07 Thread Eduardo Costa Alfaia
Hi Guys, I would like understanding why the Driver's RAM goes down, Does the processing occur only in the workers? Thanks # Start Tests computer1(Worker/Source Stream) 23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44 total used free shared

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Hmm -- That is strange. Can you paste the command you are using to launch the instances ? The typical workflow is to use the spark-ec2 wrapper script using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html Shivaram On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini

CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating version 0.9.0 without any Hadoop at all, and need some help. I run into the following error with the StatefulNetworkWordCount example (and similarly in my prototype app, when I use the updateStateByKey operation). I get

Re: Creating a SparkR standalone job

2014-04-07 Thread pawan kumar
Thanks Shivaram! Will give it a try and let you know. Regards, Pawan Venugopal On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: You can create standalone jobs in SparkR as just R files that are run using the sparkR script. These commands will be sent

job offering

2014-04-07 Thread Rault, Severan
Hi, I am looking for users of spark to join my teams here at Amazon. If you are reading this you probably qualify. I am looking for developer of ANY level, but with an interest in spark. My teams are leveraging spark to solve real business scenarios. If you are interested, just shoot me a note

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das
Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2. Are you setting the HDFS configuration correctly in your Spark program? For example, can you write a HDFS file from a Spark program (say spark-shell) to your HDFS

RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
ok yeah we are using StageInfo and TaskInfo too... On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Other users have expressed interest for us to expose similar classes too (i.e. StageInfo, TaskInfo). In the newest release, they will be available as part of

RE: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
1.: I will paste the full content of the environment page of the example application running against the cluster at the end of this message. 2. and 3.: Following #2 I was able to see that the count was incorrectly 0 when running against the cluster, and following #3 I was able to get the

答复: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-07 Thread Francis . Hu
Great!!! When i built it on another disk whose format is ext4, it works right now. hadoop@ubuntu-1:~$ df -Th FilesystemType Size Used Avail Use% Mounted on /dev/sdb6 ext4 135G 8.6G 119G 7% / udev devtmpfs 7.7G 4.0K 7.7G 1% /dev tmpfs

Re: trouble with join on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote: I am running the latest version of PySpark branch-0.9 and having some trouble with join. One RDD is about 100G (25GB compressed and serialized in memory) with 130K records, the other RDD is about 10G (2.5G

[BLOG] For Beginners

2014-04-07 Thread prabeesh k
Hi all, Here I am sharing a blog for beginners, about creating spark streaming stand alone application and bundle the app as single runnable jar. Take a look and drop your comments in blog page. http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html

Mongo-Hadoop Connector with Spark

2014-04-07 Thread Pavan Kumar
Hi Everyone, I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce. previously i have successfully processed normal mongoDB collections(not GridFS) with Apache spark using Mongo-Hadoop connector. now i'm unable to handle input