GraphX. How to remove vertex or edge?

2014-04-30 Thread Николай Кинаш
Hello. How to remove vertex or edges from graph in GraphX?

How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-04-30 Thread PengWeiPRC
Hi there, I was wondering if somebody could give me some suggestions about how to handle this situation: I have a spark program, in which it reads a 6GB file first (Not RDD) locally, and then do the map/reduce tasks. This 6GB file contains information that will be shared by all the map tasks.

Re: something about memory usage

2014-04-30 Thread wxhsdp
Hi, daniel, thx for your help i'am just running 1 core slaves. but still i can not work it out. the executor does the task one by one, task0, task1, task2... how can i get the memory task1 used with so many threads running in the background, also with GC.

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell
This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = "/some/dir" rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + "part-0") f.move

same partition id means same location?

2014-04-30 Thread wxhsdp
Hi, i'am just reviewing "advanced spark features". it's about the pagerank example. it said "any shuffle operation on two RDDs will take on the partitioner of one of them, if one is set". so first we partition the Links by hashPartitioner, then we join the Links and Ranks0. Ranks0 will tak

CDH 5.0 and Spark 0.9.0

2014-04-30 Thread Paul Schooss
Hello, So I was unable to run the following commands from the spark shell with CDH 5.0 and spark 0.9.0, see below. Once I removed the property io.compression.codec.lzo.class com.hadoop.compression.lzo.LzoCodec true from the core-site.xml on the node, the spark commands worked. Is there a spec

Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-30 Thread Tathagata Das
Whatever is inside the mapPartition get executed on workers. If that mapPartition function refers to a global variable in the driver, then that variable get serialized and sent to the workers as well. So the hll (defined in lline 63) is an empty HyperLogLogMonoid, that gets serialized and sent to w

Re: Strange lookup behavior. Possible bug?

2014-04-30 Thread Yadid Ayzenberg
Dear Sparkers, Has anyone got any insight on this ? I am really stuck. Yadid On 4/28/14, 11:28 AM, Yadid Ayzenberg wrote: Thanks for your answer. I tried running on a single machine - master and worker on one host. I get exactly the same results. Very little CPU activity on the machine in qu

update of RDDs

2014-04-30 Thread narayanabhatla NarasimhaMurthy
In our application, we need distributed RDDs containing key-value maps. We have operations that update RDDs by way of adding entries to the map, delete entries from the map as well as update value part of maps. We also have map reduce functions that operate on the RDDs.The questions are the followi

Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-30 Thread buremba
Thanks for your reply. Sorry for the late response, I wanted to do some tests before writing back. The counting part works similar to your advice. I specify a minimum interval like 1 minute, in each hour, day etc. it sums all counters of the current children intervals. However when I want to "cou

My talk on "Spark: The Next Top (Compute) Model"

2014-04-30 Thread Dean Wampler
I meant to post this last week, but this is a talk I gave at the Philly ETE conf. last week: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model Also here: http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf dean -- Dean Wampler, Ph.D. Typesafe @deanwampl

[ANN]: Scala By The Bay Conference ( aka Silicon Valley Scala Symposium)

2014-04-30 Thread Chester Chen
Hi,        This is not related to Spark. But I thought you might be interested in the  second SF Scala conference is coming this August. The SF Scala conference was called "Sillicon Valley Scala Symposium" last year.  From now on, it will be known as "Scala By The Bay".  http://www.scalabythebay

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread Marcelo Vanzin
Hi, One thing you can do is set the spark version your project depends on to "1.0.0-SNAPSHOT" (make sure it matches the version of Spark you're building); then before building your project, run "sbt publishLocal" on the Spark tree. On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wrote: > i fixed it. >

Re: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Tathagata Das
Yeah, I remember changing fold to sum in a few places, probably in testsuites, but missed this example I guess. On Wed, Apr 30, 2014 at 1:29 PM, Sean Owen wrote: > S is the previous count, if any. Seq[V] are potentially many new > counts. All of them have to be added together to keep an accura

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter
Thanks Nicholas, this is a bit of a shame, not very practical for log roll up for example when every output needs to be in it's own "directory".  On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas wrote: Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1),

Re: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Sean Owen
S is the previous count, if any. Seq[V] are potentially many new counts. All of them have to be added together to keep an accurate total. It's as if the count were 3, and I tell you I've just observed 2, 5, and 1 additional occurrences -- the new count is 3 + (2+5+1) not 1 + 1. I butted in since

RE: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Adrian Mocanu
Hi TD, Why does the example keep recalculating the count via fold? Wouldn’t it make more sense to get the last count in values Seq and add 1 to it and save that as current count? From what Sean explained I understand that all values in Seq have the same key. Then when a new value for that key is

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Nicholas Chammas
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that’s defined by the Hadoop API that Spark uses for readi

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
I agree with you in general that as an API user, I shouldn’t be relying on code. However, without looking at the code, there is no way for me to find out even whether map() keeps the row order. Without the knowledge at all, I’d need to do “sort” every time I need certain things in a certain order.

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter
Ah, looks like RDD.coalesce(1) solves one part of the problem. On Wednesday, April 30, 2014 11:15 AM, Peter wrote: Hi Playing around with Spark & S3, I'm opening multiple objects (CSV files) with:     val hfile = sc.textFile("s3n://bucket/2014-04-28/") so hfile is a RDD representing 10 object

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mark Hamstra
Which is what you shouldn't be doing as an API user, since that implementation code might change. The documentation doesn't mention a row ordering guarantee, so none should be assumed. It is hard enough for us to correctly document all of the things that the API does do. We really shouldn't be f

Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter
Hi Playing around with Spark & S3, I'm opening multiple objects (CSV files) with:     val hfile = sc.textFile("s3n://bucket/2014-04-28/") so hfile is a RDD representing 10 objects that were "underneath" 2014-04-28. After I've sorted and otherwise transformed the content, I'm trying to write it

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
Okay, that makes sense. It’d be great if this can be better documented at some point, because the only way to find out about the resulting RDD row order is by looking at the code. Thanks for the discussion! Mingyu On 4/29/14, 11:59 PM, "Patrick Wendell" wrote: >I don't think we guarantee an

Re: processing s3n:// files in parallel

2014-04-30 Thread foundart
Thanks, Andrew. As it turns out, the tasks were getting processed in parallel in separate threads on the same node. Using the parallel collection of hadoop files was sufficient to trigger that but my expectation that the tasks would be spread across nodes rather than cores on a single node led me

Re: the spark configuage

2014-04-30 Thread Diana Carroll
I'm guessing your shell stopping when it attempts to connect to the RM is not related to that warning. You'll get that message out of the box from Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0 installed in default locations, and got rid of the warning by setting HADOOP_HOME

new Washington DC Area Spark Meetup

2014-04-30 Thread Donna-M. Fernandez
Hi, all! For those in the Washington DC area (DC/MD/VA), we just started a new Spark Meetup. We'd love for you to join! -d Here's the link: http://www.meetup.com/Washington-DC-Area-Spark-Interactive/ Description: This is an interactive meetup for Washington DC, Virginia and Maryland users, e

Can a job running on a cluster read from a local file path ?

2014-04-30 Thread Shubhabrata
1) Can a job (python script), running on a standalone cluster read from local file path ? 2) Does sc.addPyFile(path) create a directory or only copies the file ? 3) If the path contains a zip file, does it automatically gets unzipped ? -- View this message in context: http://apache-spark-user

Re: Shuffle phase is very slow, any help, thx!

2014-04-30 Thread Daniel Darabos
So the problem is that 99 tasks are fast (< 1 second), but 1 task is really slow (5+ hours), is that right? And your operation is graph.vertices.count? That is odd, but it could be that this job includes running previous transformations. How did you construct the graph? On Tue, Apr 29, 2014 at 3:4

Re: something about memory usage

2014-04-30 Thread Daniel Darabos
On Wed, Apr 30, 2014 at 1:52 PM, wxhsdp wrote: > Hi, guys > > i want to do some optimizations of my spark codes. i use VisualVM to > monitor the executor when run the app. > here's the snapshot: > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n5107/executor.png > > > > from the

something about memory usage

2014-04-30 Thread wxhsdp
Hi, guys i want to do some optimizations of my spark codes. i use VisualVM to monitor the executor when run the app. here's the snapshot: from the snapshot, i can get the memory usage information about the executor

Re: the spark configuage

2014-04-30 Thread Andras Nemeth
On 30 Apr 2014 10:35, "Akhil Das" wrote: > > Hi > > The reason you saw that warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. > > Anyway, it's just a warning, and won't impact Hadoop's functionalities. > > Here is the way if you do wan

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread Andras Nemeth
On 30 Apr 2014 06:59, "Patrick Wendell" wrote: > > The signature of this function was changed in spark 1.0... is there > any chance that somehow you are actually running against a newer > version of Spark? > > On Tue, Apr 29, 2014 at 8:58 PM, wxhsdp wrote: > > i met with the same question when up

Re: the spark configuage

2014-04-30 Thread Rahul Singhal
Hi, Just in case you already have the 64 bit version, the following works for me on spark 0.9.1 SPARK_LIBRARY_PATH=/opt/hadoop/lib/native/ ./bin/spark-shell (where my libhadoop.so is present in /opt/hadoop/lib/native/) Thanks, Rahul Singhal From: Akhil Das mailto:ak...@sigmoidanalytics.com>>

Re: the spark configuage

2014-04-30 Thread Akhil Das
Hi The reason you saw that warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. Anyway, it's just a warning, and won't impact Hadoop's functionalities. Here is the way if you do want to eliminate this warning, download the source code o

Re: the spark configuage

2014-04-30 Thread Akhil Das
Hi The reason you saw that warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. Anyway, it's just a warning, and won't impact Hadoop's functionalities. Here is the way if you do want to eliminate this warning, download the source code o

Re: Joining not-pair RDDs in Spark

2014-04-30 Thread jsantos
That's the approach I finally used. Thanks for your help :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-not-pair-RDDs-in-Spark-tp5034p5099.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

the spark configuage

2014-04-30 Thread Sophia
Hi, when I configue spark, run the shell instruction: ./spark-shellit told me like this: WARN:NativeCodeLoader:Uable to load native-hadoop livrary for your builtin-java classes where applicable,when it connect to ResourceManager,it stopped. What should I DO? Wish your reply -- View this mess

Re: Shuffle Spill Issue

2014-04-30 Thread Daniel Darabos
Whoops, you are right. Sorry for the misinformation. Indeed reduceByKey just calls combineByKey: def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = { combineByKey[V]((v: V) => v, func, func, partitioner) } (I think I confused reduceByKey with groupByKey.) On Wed, Apr

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread wxhsdp
i fixed it. i make my sbt project depend on spark/trunk/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar and it works -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-tp4937p5096.html Sent from the A

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Patrick Wendell
I don't think we guarantee anywhere that union(A, B) will behave by concatenating the partitions, it just happens to be an artifact of the current implementation. rdd1 = [1,2,3] rdd2 = [1,4,5] rdd1.union(rdd2) = [1,2,3,1,4,5] // how it is now rdd1.union(rdd2) = [1,4,5,1,2,3] // some day it could