Re: distinct in data frame in spark

2014-03-24 Thread Andrew Ash
My thought would be to key by the first item in each array, then take just one array for each key. Something like the below: v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5))) col = 0 output = v.keyBy(_(col)).reduceByKey(a,b => a).values On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu

Spark Streaming ZeroMQ Java Example

2014-03-24 Thread goofy real
Is there a ZeroMQWordCount Java sample code? https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala

Re: coalescing RDD into equally sized partitions

2014-03-24 Thread Matei Zaharia
This happened because they were integers equal to 0 mod 5, and we used the default hashCode implementation for integers, which will map them all to 0. There’s no API method that will look at the resulting partition sizes and rebalance them, but you could use another hash function. Matei On Mar

Re: spark executor/driver log files management

2014-03-24 Thread Sourav Chandra
Hi TD, I thought about that but was not sure whether this will have any impact in spark UI/ Executor runner as it redirects stream to stderr/stdout. But ideally it should not as it will fetch the log record from stderr file (which is latest).. Is my understanding correct? Thanks, Sourav On Tue

答复: 答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd with non-idempotent "compute" function, though it is really a bad design to make that function non-idempotent

Re: 答复: RDD usage

2014-03-24 Thread hequn cheng
First question: If you save your modified RDD like this: points.foreach(p=>p.y = another_value).collect() or points.foreach(p=>p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming wrote: > Thanks again. > >> If you use the KMeans implementation from MLlib, the >> initialization stage is done on master, > > The "master" here is the app/driver/spark-sh

Re: RDD usage

2014-03-24 Thread Mark Hamstra
No, it won't. The type of RDD#foreach is Unit, so it doesn't return an RDD. The utility of foreach is purely for the side effects it generates, not for its return value -- and modifying an RDD in place via foreach is generally not a very good idea. On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng

答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? -原始邮件- 发件人: "hequn cheng" 发送时间: ‎2014/‎3/‎25 9:35 收件人: "user@spark.apache.org" 主题: R

Re: RDD usage

2014-03-24 Thread hequn cheng
points.foreach(p=>p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen : > Dear all, > > I have a question about the usage of RDD. > I implemented a class called AppDataPoint, it looks like: > > case class AppDataPoint(input_y : Double, input_x : Array[Double

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks again. > If you use the KMeans implementation from MLlib, the > initialization stage is done on master, The “master” here is the app/driver/spark-shell? Thanks! On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote: > Number of rows doesn't matter much as long as you have enough workers >

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu
I didn’t group the integers, but process them in group of two, partition that scala> val a = sc.parallelize(List(1, 2, 3, 4), 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 process each partition and process elements in the partition in group of 2 sc

Re: Writing RDDs to HDFS

2014-03-24 Thread Yana Kadiyska
Ognen, can you comment if you were actually able to run two jobs concurrently with just restricting spark.cores.max? I run Shark on the same cluster and was not able to see a standalone job get in (since Shark is a "long running" job) until I restricted both spark.cores.max _and_ spark.executor.mem

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Yes, actually even for spark, I mostly use the sbt I installed…..so always missing this issue…. If you can reproduce the problem with a spark-distribtued sbt…I suggest proposing a PR to fix the document, before 0.9.1 is officially released Best, -- Nan Zhu On Monday, March 24, 2014 at

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu
partition your input into even number partitions use mapPartition to operate on Iterator[Int] maybe there are some more efficient way…. Best, -- Nan Zhu On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote: > Hi, I have large data set of numbers ie RDD and wanted to perform a > comput

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
It is suggested implicitly in giving you the command "./sbt/sbt". The separately installed sbt isn't in a folder called sbt, whereas Spark's version is. And more relevantly, just a few paragraphs earlier in the tutorial you execute the command "sbt/sbt assembly" which definitely refers to the spar

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau
There is also https://github.com/apache/spark/pull/18 against the current repo which may be easier to apply. On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh wrote: > Hi Jaonary, > > You can find the code for k-fold CV in > https://github.com/apache/incubator-spark/pull/448. I have not find the >

coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat
Hi, sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect yields Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array()) How do I get something more like: Array(Array(0), Array(20), Array(40), Array(60), Array(80)) Thank

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
We need some one who can explain us with short code snippet on given example so that we get clear cut idea on RDDs indexing.. Guys please help us -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp31

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Walrus theCat
I'm also interested in this. On Mon, Mar 24, 2014 at 4:59 PM, yh18190 wrote: > Hi, I have large data set of numbers ie RDD and wanted to perform a > computation only on groupof two values at a time. For example > 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? > and p

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
I found that I never read the document carefully and I never find that Spark document is suggesting you to use Spark-distributed sbt…… Best, -- Nan Zhu On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote: > Thanks for your help, everyone. Several folks have explained that I can >

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
Diana, I think you are correct - I just installed wget http://mirror.symnds.com/software/Apache/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-cdh4.tgz and indeed I see the same error that you see It looks like in previous versions sbt-launch used to just come down in the pack

Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
Hi,I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time.For example1,2,3,4,5,6,7... is an RDDCan i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner?As we do'nt have a way to index e

Re: combining operations elegantly

2014-03-24 Thread Richard Siebeling
Hi guys, thanks for the information, I'll give it a try with Algebird, thanks again, Richard @Patrick, thanks for the release calendar On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell wrote: > Hey All, > > I think the old thread is here: > https://groups.google.com/forum/#!msg/spark-users/gVt

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Sweet! That's simple enough. Here's a JIRA ticket to track adding this to PySpark for the future: https://spark-project.atlassian.net/browse/SPARK-1308 Nick On Mon, Mar 24, 2014 at 4:29 PM, Patrick Wendell wrote: > Ah we should just add this directly in pyspark - it's as simple as the > code

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Just so I can close this thread (in case anyone else runs into this stuff) - I did sleep through the basics of Spark ;). The answer on why my job is in waiting state (hanging) is here: http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling Ognen On 3/24/14, 5:

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Walrus theCat
If someone wanted / needed to implement this themselves, are partitions the correct way to go? Any tips on how to get started (say, dividing an RDD into 5 parts)? On Fri, Mar 21, 2014 at 9:51 AM, Jaonary Rabarisoa wrote: > Thank you Hai-Anh. Are the files CrossValidation.scala and > RandomS

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Niko Stahl
Hi Ankur, hi Deb, Thanks for the information and for the reference to the recent paper. I understand that GraphLab is highly optimized for graph algorithms and consistently outperforms GraphX for graph related tasks. I'd like to further evaluate the cost of moving data between Spark and some other

Re: question about partitions

2014-03-24 Thread Walrus theCat
Syed, Thanks for the tip. I'm not sure if coalesce is doing what I'm intending to do, which is, in effect, to subdivide the RDD into N parts (by calling coalesce and doing operations on the partitions.) It sounds like, however, this won't bottleneck my processing power. If this sets off any ala

Re: [bug?] streaming window unexpected behaviour

2014-03-24 Thread Tathagata Das
Yes, I believe that is current behavior. Essentially, the first few RDDs will be partial windows (assuming window duration > sliding interval). TD On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu wrote: > I have what I would call unexpected behaviour when using window on a stream. > > I have 2 wi

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das
Hello Sanjay, Yes, your understanding of lazy semantics is correct. But ideally every batch should read based on the batch interval provided in the StreamingContext. Can you open a JIRA on this? On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani wrote: > Hi All, > > I found out why this problem

Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls at the same point in each case. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Mar 23, 2014, at 9:56 P

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put to put things as files into the filesystem (and sc.textFile to get stuff out of them in Spark) and I see that they appear to be saved as files that are replicated across 3 out of the 16 nodes in the hdfs cluster (which is m

Re: Writing RDDs to HDFS

2014-03-24 Thread Diana Carroll
Ongen: I don't know why your process is hanging, sorry. But I do know that the way saveAsTextFile works is that you give it a path to a directory, not a file. The "file" is saved in multiple parts, corresponding to the partitions. (part-0, part-1 etc.) (Presumably it does this because i

Re: spark executor/driver log files management

2014-03-24 Thread Tathagata Das
You can use RollingFileAppenders in log4j.properties. http://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html You can have other scripts delete old logs. TD On Mon, Mar 24, 2014 at 12:20 AM, Sourav Chandra < sourav.chan...@livestream.com> wrote: > Hi, >

Re: question about partitions

2014-03-24 Thread Syed A. Hashmi
RDD.coalesce should be fine for rebalancing data across all RDD partitions. Coalesce is pretty handy in situations where you have sparse data and want to compact it (e.g. data after applying a strict filter) OR you know the magic number of partitions according to your cluster which will be optimal.

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks for your help, everyone. Several folks have explained that I can surely solve the problem by installing sbt. But I'm trying to get the instructions working *as written on the Spark website*. The instructions not only don't have you install sbt separately...they actually specifically have

Re: question about partitions

2014-03-24 Thread Walrus theCat
For instance, I need to work with an RDD in terms of N parts. Will calling RDD.coalesce(N) possibly cause processing bottlenecks? On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote: > Hi, > > Quick question about partitions. If my RDD is partitioned into 5 > partitions, does that mean that I

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Hmm. Strange. Even the below hangs. val r = sc.parallelize(List(1,2,3,4,5,6,7,8)) r.count I then looked at the web UI at port 8080 and realized that the spark shell is in WAITING status since another job is running on the standalone cluster. This may sound like a very stupid question but my e

Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt") supposed to work? Meaning, can I save files to the HDFS fs this way? I tried: val r = sc.parallelize(List(1,2,3,4,5,6,7,8)) r.saveAsTextFile("hdfs://ip:port/path/file.txt") and it is just hanging. At the same time on my HDFS i

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Hi, Diana, You don’t need to use spark-distributed sbt just download sbt from its official website and set your PATH to the right place Best, -- Nan Zhu On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote: > Yeah, that's exactly what I did. Unfortunately it doesn't work: > >

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8 (since like other folks I had sbt preinstalled on my "usual" machine) I ran the command exactly as Ognen suggested and see Set current project to Simple Project (do you see this -- you should at least be seeing this) and then a bunch

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
Ah crud, I guess you are right, I am using the sbt I installed manually with my Scala installation. Well, here is what you can do: mkdir ~/bin cd ~/bin wget http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.1/sbt-launch.jar vi sbt Put the following contents into you

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das
Hi Ankur, Given enough memory and proper caching, I don't understand why is this the case? GraphX may actually be slower when Spark is configured to launch many tasks per machine, because shuffle communication between Spark tasks on the same machine still occurs by reading and writing from disk,

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Soumya Simanta
@Diana - you can set sbt manually for your project by following the instructions here. http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html Manual Installation¶ Manual installation requires downloa

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman wrote: > There is no direct way to get this in pyspark, but you can get it from the > underlying java rdd. For example > > a = sc.para

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Yeah, that's exactly what I did. Unfortunately it doesn't work: $SPARK_HOME/sbt/sbt package awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for reading (No such file or directory) Attempting to fetch sbt /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or d

question about partitions

2014-03-24 Thread Walrus theCat
Hi, Quick question about partitions. If my RDD is partitioned into 5 partitions, does that mean that I am constraining it to exist on at most 5 machines? Thanks

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
You can use any sbt on your machine, including the one that comes with spark. For example, try: ~/path_to_spark/sbt/sbt compile ~/path_to_spark/sbt/sbt run Or you can just add that to your PATH by: export $PATH=$PATH:~/path_to_spark/sbt To make it permanent, you can add it to your ~/.bashrc

Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman
There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Mark, > > This appears to be a Scala-only

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Bharath Bhushan
Creating simple.sbt and src/ in $SPARK_HOME allows me to run a standalone scala program in the downloaded spark code tree. For example my directory layout is: $ ls spark-0.9.0-incubating-bin-hadoop2 … simple.sbt src … $ tree src src `-- main `-- scala `— SimpleApp.scala — Bharath On

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks Ongen. Unfortunately I'm not able to follow your instructions either. In particular: > > sbt compile > sbt run This doesn't work for me because there's no program on my path called "sbt". The instructions in the Quick Start guide are specific that I should call "$SPARK_HOME/sbt/sbt".

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks, Nan Zhu. You say that my problems are "because you are in Spark directory, don't need to do that actually , the dependency on Spark is resolved by sbt" I did try it initially in what I thought was a much more typical place, e.g. ~/mywork/sparktest1. But as I said in my email: (Just for

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
Diana, Anywhere on the filesystem you have read/write access (you need not be in your spark home directory): mkdir myproject cd myproject mkdir project mkdir target mkdir -p src/main/scala cp $mypath/$mymysource.scala src/main/scala/ cp $mypath/myproject.sbt . Make sure that myproject.sbt has

[bug?] streaming window unexpected behaviour

2014-03-24 Thread Adrian Mocanu
I have what I would call unexpected behaviour when using window on a stream. I have 2 windowed streams with a 5s batch interval. One window stream is (5s,5s)=smallWindow and the other (10s,5s)=bigWindow What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is of the size 5s

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Yana: Thanks. Can you give me a transcript of the actual commands you are running? THanks! Diana On Mon, Mar 24, 2014 at 3:59 PM, Yana Kadiyska wrote: > I am able to run standalone apps. I think you are making one mistake > that throws you off from there onwards. You don't need to put your app

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Ankur Dave
Hi Niko, The GraphX team recently wrote a longer paper with more benchmarks and optimizations: http://arxiv.org/abs/1402.2394 Regarding the performance of GraphX vs. GraphLab, I believe GraphX currently outperforms GraphLab only in end-to-end benchmarks of pipelines involving both graph-parallel

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
I am able to run standalone apps. I think you are making one mistake that throws you off from there onwards. You don't need to put your app under SPARK_HOME. I would create it in its own folder somewhere, it follows the rules of any standalone scala program (including the layout). In the giude, $SP

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Hi, Diana, See my inlined answer -- Nan Zhu On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote: > Has anyone successfully followed the instructions on the Quick Start page of > the Spark home page to run a "standalone" Scala application? I can't, and I > figure I must be miss

quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a "standalone" Scala application? I can't, and I figure I must be missing something obvious! I'm trying to follow the instructions here as close to "word for word" as possible: http://spark.apa

Re: Transitive dependency incompatibility

2014-03-24 Thread Jaka Jančar
Just found this issue and wanted to link it here, in case somebody finds this thread later: https://spark-project.atlassian.net/browse/SPARK-939 On Thursday, March 20, 2014 at 11:14 AM, Matei Zaharia wrote: > Hi Jaka, > > I’d recommend rebuilding Spark with a new version of the HTTPClient

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das
Niko, Comparing some other components will be very useful as wellsvd++ from graphx vs the same algorithm in graphlabalso mllib.recommendation.als implicit/explicit compared to the collaborative filtering toolkit in graphlab... To stress test what's the biggest sparse dataset that you have

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Thanks Aaron !! On Mon, Mar 24, 2014 at 10:58 PM, Aaron Davidson wrote: > 1. Note sure on this, I don't believe we change the defaults from Java. > > 2. SPARK_JAVA_OPTS can be used to set the various Java properties (other > than memory heap size itself) > > 3. If you want to have 8 GB executor

Comparing GraphX and GraphLab

2014-03-24 Thread Niko Stahl
Hello, I'm interested in extending the comparison between GraphX and GraphLab presented in Xin et. al (2013). The evaluation presented there is rather limited as it only compares the frameworks for one algorithm (PageRank) on a cluster with a fixed number of nodes. Are there any graph algorithms w

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
1. Note sure on this, I don't believe we change the defaults from Java. 2. SPARK_JAVA_OPTS can be used to set the various Java properties (other than memory heap size itself) 3. If you want to have 8 GB executors then, yes, only two can run on each 16 GB node. (In fact, you should also keep a sig

Re: distinct on huge dataset

2014-03-24 Thread Aaron Davidson
Look up setting ulimit, though note the distinction between soft and hard limits, and that updating your hard limit may require changing /etc/security/limits.confand restarting each worker. On Mon, Mar 24, 2014 at 1:39 AM, Kane wrote: > Got a bit further, i think out of memory error was caused

distinct in data frame in spark

2014-03-24 Thread Chengi Liu
Hi, I have a very simple use case: I have an rdd as following: d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]] Now, I want to remove all the duplicates from a column and return the remaining frame.. For example: If i want to remove the duplicate based on column 1. Then basically I would remove either row

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done on m

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-24 Thread Debasish Das
After a long time (may be a month) I could do a fresh build for 2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache My case is especially painful since I have to build behind a firewall... @Sean thanks for the fix...I think we should put a test for https/firewall compilation as well.

remove duplicates

2014-03-24 Thread Adrian Mocanu
I have a DStream like this: ..RDD[a,b],RDD[b,c].. Is there a way to remove duplicates across the entire DStream? Ie: I would like the output to be (by removing one of the b's): ..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c].. Thanks -Adrian

Re: pySpark memory usage

2014-03-24 Thread Matei Zaharia
Hey Jim, In Spark 0.9 we added a “batchSize” parameter to PySpark that makes it group multiple objects together before passing them between Java and Python, but this may be too high by default. Try passing batchSize=10 to your SparkContext constructor to lower it (the default is 1024). Or even

Re: Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Oh, I also forgot to mention: I start the master and workers (call ./sbin/start-all.sh), and then start the shell: MASTER=spark://localhost:7077 ./bin/spark-shell Then I get the exceptions... Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-

Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Hi, I'm running my program on a single large memory many core machine (64 cores, 1TB RAM). But to avoid having huge JVMs, I want to use several processes / worker instances - each using 8 cores (i.e. use SPARK_WORKER_INSTANCES). When I use 2 worker instances, everything works fine, but when I try

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra wrote: > It's much simpler: rdd.partitions.size > > > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Sanjay Awatramani
Hi All, I found out why this problem exists. Consider the following scenario: - a DStream is created from any source. (I've checked with file and socket) - No actions are applied to this DStream - Sliding Window operation is applied to this DStream and an action is applied to the sliding window.

Re: mapPartitions use case

2014-03-24 Thread Nathan Kronenfeld
I've seen two cases most commonly: The first is when I need to create some processing object to process each record. If that object creation is expensive, creating one per record becomes prohibitive. So instead, we use mapPartition, and create one per partition, and use it on each record in the

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Another thing I have noticed is that out of my master+15 slaves, two slaves always carry a higher inode load. So for example right now I am running an intensive job that takes about an hour to finish and two slaves have been showing an increase in inode consumption (they are about 10% above the

Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld
What does this error mean: @hadoop-s2.oculus.local:45186]: Error [Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186] Caused by: akka.remote.transport.netty.Nett

mapPartitions use case

2014-03-24 Thread Jaonary Rabarisoa
Dear all, Sorry for asking such a basic question, but someone can explain when one should use mapPartiontions instead of map. Thanks Jaonary

Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread yaoxin
We now have a method to work this around. For the classes that can't easily implement serialized, we wrap this class a scala object. For example: class A {} // This class is not serializable, object AHolder { private val a: A = new A() def get: A = a } This wo

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Oh, glad to know it's that simple! Patrick, in your last comment did you mean filter in? As in I start with one year of data and filter it so I have one day left? I'm assuming in that case the empty partitions would be for all the days that got filtered out. Nick 2014년 3월 24일 월요일, Patrick Wendel

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, the inode usage was about 50%. On two of the slaves, one had inode usage of 96% and on the other it was 100%. When i went into /tmp on these two nodes - there were a bunch of /tmp/spark* subdirectories which I deleted. This r

Re: Spark Java example using external Jars

2014-03-24 Thread dmpour23
Hello, Has anyone got any ideas? I am not quite sure if my problem is an exact fit for Spark. Since in reality in this section of my program i am not really doing a reduce job simply a group by and partition. Would calling pipe on the Partiotined JavaRDD do the trick? Are there any examples usin

RDD usage

2014-03-24 Thread Chieh-Yen
Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created th

Re: Spark worker threads waiting

2014-03-24 Thread sparrow
Yes my input data is partitioned in a completely random manner, so each worker that produces shuffle data processes only a part of it. The way I understand it is that before each stage each workers needs to distribute correct partitions (based on hash key ranges?) to other workers. And this is wher

Re: Java API - Serialization Issue

2014-03-24 Thread Sourav Chandra
I can suggest two things: 1. While creating worker, submitting task make sure you are not keeping any unwanted external class resource (which is not used in closure and not serializable) 2. If this is ensured and you still get some issue from 3rd party library you can make thet 3rd party variable

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Thanks Aaron and Sean... Setting SPARK_MEM finally worked. But i have a small doubt. 1)What is the default value that is allocated for JVM and for HEAP_SPACE for Garbage collector. 2)Usually we set 1/3 of total memory for heap. So what should be the practice for Spark processes. Where & how shoul

Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread santhoma
Can someone answer this question please? Specifically about the Serializable implementation of dependent jars .. ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3087.html Sent from the Apache

Re: Java API - Serialization Issue

2014-03-24 Thread santhoma
I am also facing the same problem. I have implemented Serializable for my code, but the exception is thrown from third party libraries on which I have no control . Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: (li

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sean Owen
PS you have a typo in "DEAMON" - its DAEMON. Thanks Latin. On Mar 24, 2014 7:25 AM, "Sai Prasanna" wrote: > Hi All !! I am getting the following error in interactive spark-shell > [0.8.1] > > > *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more > than 0 times; aborting job jav

Re: distinct on huge dataset

2014-03-24 Thread Kane
Got a bit further, i think out of memory error was caused by setting spark.spill to false. Now i have this error, is there an easy way to increase file limit for spark, cluster-wide?: java.io.FileNotFoundException: /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 (

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
To be clear on what your configuration will do: - SPARK_DAEMON_MEMORY=8g will make your standalone master and worker schedulers have a lot of memory. These do not impact the actual amount of useful memory given to executors or your driver, however, so you probably don't need to set this. - SPARK_W

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark? On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: > K = 50 is certainly a large number for k-means.

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem. -Xiangrui

GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Hi All !! I am getting the following error in interactive spark-shell [0.8.1] *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit exceeded* But i had set the following in the spark.env.sh and hadoop-env.s

spark executor/driver log files management

2014-03-24 Thread Sourav Chandra
Hi, I have few questions regarding log file management in spark: 1. Currently I did not find any way to modify the lof file name for executor/drivers). Its hardcoded as stdout and stderr. Also there is no log rotation. In case of streaming application this will grow forever and become unmanageab