spark executor/driver log files management

2014-03-24 Thread Sourav Chandra
Hi,

I have few questions regarding log file management in spark:

1. Currently I did not find any way to modify the lof file name for
executor/drivers). Its hardcoded as stdout and stderr. Also there is no log
rotation.

In case of streaming application this will grow forever and become
unmanageable. Is there any way to overcome this?

Thanks,
-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Hi All !! I am getting the following error in interactive spark-shell
[0.8.1]


*org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than
0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
exceeded*


But i had set the following in the spark.env.sh and hadoop-env.sh

export SPARK_DEAMON_MEMORY=8g
export SPARK_WORKER_MEMORY=8g
export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g"
export SPARK_JAVA_OPTS="-Xms8g -Xmx8g"


export HADOOP_HEAPSIZE=4000

Any suggestions ??

-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.

-Xiangrui

On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  wrote:
> Hi,
>
> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>
> (0.9.0) is deployed in a stand alone mode.
>
> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>
> This is on the first iteration. K=50. Here's the code I use:
> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>
> Thanks!
>
>
>
> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>
>> Hi Tsai,
>>
>> Could you share more information about the machine you used and the
>> training parameters (runs, k, and iterations)? It can help solve your
>> issues. Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
>>> Hi,
>>>
>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
>>> working.
>>>
>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>
>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>
>>> And there is nothing particular on the master log file. From the 
>>> spark-shell:
>>>
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
>>> executor 2: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
>>> bytes in 193 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
>>> executor 1: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
>>> bytes in 96 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
>>> executor 0: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
>>> bytes in 100 ms
>>>
>>> But it stops there for some significant time before any movement.
>>>
>>> In the stage detail of the UI, I can see that there are 127 tasks running 
>>> but the duration each is at least a few minutes.
>>>
>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
>>> (50M rows).
>>>
>>> Is this a normal behaviour?
>>>
>>> Thanks!
>


Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks, Let me try with a smaller K.

Does the size of the input data matters for the example? Currently I have 50M 
rows. What is a reasonable size to demonstrate the capability of Spark?





On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:

> K = 50 is certainly a large number for k-means. If there is no
> particular reason to have 50 clusters, could you try to reduce it
> to, e.g, 100 or 1000? Also, the example code is not for large-scale
> problems. You should use the KMeans algorithm in mllib clustering for
> your problem.
> 
> -Xiangrui
> 
> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  wrote:
>> Hi,
>> 
>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>> 
>> (0.9.0) is deployed in a stand alone mode.
>> 
>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>> 
>> This is on the first iteration. K=50. Here's the code I use:
>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>> 
>> Thanks!
>> 
>> 
>> 
>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>> 
>>> Hi Tsai,
>>> 
>>> Could you share more information about the machine you used and the
>>> training parameters (runs, k, and iterations)? It can help solve your
>>> issues. Thanks!
>>> 
>>> Best,
>>> Xiangrui
>>> 
>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
 Hi,
 
 At the reduceBuyKey stage, it takes a few minutes before the tasks start 
 working.
 
 I have -Dspark.default.parallelism=127 cores (n-1).
 
 CPU/Network/IO is idling across all nodes when this is happening.
 
 And there is nothing particular on the master log file. From the 
 spark-shell:
 
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
 executor 2: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
 bytes in 193 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
 executor 1: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
 bytes in 96 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
 executor 0: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
 bytes in 100 ms
 
 But it stops there for some significant time before any movement.
 
 In the stage detail of the UI, I can see that there are 127 tasks running 
 but the duration each is at least a few minutes.
 
 I'm working off local storage (not hdfs) and the kmeans data is about 
 6.5GB (50M rows).
 
 Is this a normal behaviour?
 
 Thanks!
>> 



Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
To be clear on what your configuration will do:

- SPARK_DAEMON_MEMORY=8g will make your standalone master and worker
schedulers have a lot of memory. These do not impact the actual amount of
useful memory given to executors or your driver, however, so you probably
don't need to set this.
- SPARK_WORKER_MEMORY=8g allows each worker to provide up to 8g worth of
executors. In itself, this does not actually give executors more memory,
just allows them to get more. This is a necessary setting.

- *_JAVA_OPTS should not be used to set memory parameters, as they may or
may not override their *_MEMORY counterparts.

The two things you are not configuring are the amount of memory for your
driver (for a 0.8.1 spark-shell, you must use SPARK_MEM) and the amount of
memory given to each executor (spark.executor.memory). By default, Spark
executors are only 512MB in size, so you will probably want to increase
this up to the value of SPARK_WORKER_MEMORY. This will provide you with 1
executor per worker that uses all available memory, which is probably what
you want for testing purposes (it is less ideal for sharing a cluster).

In case the distinction between workers/masters (collectively "daemons"),
executors, and drivers is not clear to you, please check out the
corresponding documentation on Spark clusters:
https://spark.incubator.apache.org/docs/0.8.1/cluster-overview.html


On Mon, Mar 24, 2014 at 12:24 AM, Sai Prasanna wrote:

> Hi All !! I am getting the following error in interactive spark-shell
> [0.8.1]
>
>
>  *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
> than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
> exceeded*
>
>
> But i had set the following in the spark.env.sh and hadoop-env.sh
>
> export SPARK_DEAMON_MEMORY=8g
> export SPARK_WORKER_MEMORY=8g
> export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g"
> export SPARK_JAVA_OPTS="-Xms8g -Xmx8g"
>
>
> export HADOOP_HEAPSIZE=4000
>
> Any suggestions ??
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
>


Re: distinct on huge dataset

2014-03-24 Thread Kane
Got a bit further, i think out of memory error was caused by setting
spark.spill to false. Now i have this error, is there an easy way to
increase file limit for spark, cluster-wide?:

java.io.FileNotFoundException:
/tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
(Too many open files)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:192)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
at
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sean Owen
PS you have a typo in "DEAMON" - its DAEMON. Thanks Latin.
On Mar 24, 2014 7:25 AM, "Sai Prasanna"  wrote:

> Hi All !! I am getting the following error in interactive spark-shell
> [0.8.1]
>
>
>  *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
> than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
> exceeded*
>
>
> But i had set the following in the spark.env.sh and hadoop-env.sh
>
> export SPARK_DEAMON_MEMORY=8g
> export SPARK_WORKER_MEMORY=8g
> export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g"
> export SPARK_JAVA_OPTS="-Xms8g -Xmx8g"
>
>
> export HADOOP_HEAPSIZE=4000
>
> Any suggestions ??
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
>


Re: Java API - Serialization Issue

2014-03-24 Thread santhoma
I am also facing the same problem. I have implemented Serializable for my
code, but the exception is thrown from third party libraries on which I have
no control . 

Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Task not serializable: java.io.NotSerializableException: (lib class name
here)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

Is it mandatory that Serializable must be implemented for dependent jars as
well?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread santhoma
Can someone answer this question please?
 Specifically about  the Serializable implementation of dependent jars .. ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Thanks Aaron and Sean...

Setting SPARK_MEM finally worked. But i have a small doubt.
1)What is the default value that is allocated for JVM and for HEAP_SPACE
for Garbage collector.

2)Usually we set 1/3 of total memory for heap. So what should be the
practice for Spark processes. Where & how should we set them.
And what is the default value does it assume?

3) Moreover, if we set SPARK_MEM to say 8g and i have a 16g RAM, can only
two executors run max on a node of a cluster ??


Thanks Again !!




On Mon, Mar 24, 2014 at 2:13 PM, Sean Owen  wrote:

> PS you have a typo in "DEAMON" - its DAEMON. Thanks Latin.
> On Mar 24, 2014 7:25 AM, "Sai Prasanna"  wrote:
>
>> Hi All !! I am getting the following error in interactive spark-shell
>> [0.8.1]
>>
>>
>>  *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
>> than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
>> exceeded*
>>
>>
>> But i had set the following in the spark.env.sh and hadoop-env.sh
>>
>> export SPARK_DEAMON_MEMORY=8g
>> export SPARK_WORKER_MEMORY=8g
>> export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g"
>> export SPARK_JAVA_OPTS="-Xms8g -Xmx8g"
>>
>>
>> export HADOOP_HEAPSIZE=4000
>>
>> Any suggestions ??
>>
>> --
>> *Sai Prasanna. AN*
>> *II M.Tech (CS), SSSIHL*
>>
>>
>>


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*


Re: Java API - Serialization Issue

2014-03-24 Thread Sourav Chandra
I can suggest two things:

1. While creating worker, submitting task make sure you are not keeping any
unwanted external class resource (which is not used in closure and not
serializable)
2. If this is ensured and you still get some issue from 3rd party library
you can make thet 3rd party variable reference as transient in your code
and define private static void readObject(is: ObjectInputStream) method to
initialize that particular variable.

e.g. class MyClass extends Serializable {
@transient
private var ref = initRef() this is a 3rd party variable which is not
serialzable


private <> initRef() {
  ref = 
  return ref
}

private static void readObject(is: ObjectInputStream) {
  is.defaultReadObject() // this is to follow the java default serialzation
for all other parameters
  ref = initRef()
}
}

Thanks,
Sourav


On Mon, Mar 24, 2014 at 3:06 PM, santhoma  wrote:

> I am also facing the same problem. I have implemented Serializable for my
> code, but the exception is thrown from third party libraries on which I
> have
> no control .
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
> Task not serializable: java.io.NotSerializableException: (lib class name
> here)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>
> Is it mandatory that Serializable must be implemented for dependent jars as
> well?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3086.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: Spark worker threads waiting

2014-03-24 Thread sparrow
Yes my input data is partitioned in a completely random manner, so each
worker that produces shuffle data processes only a part of it. The way I
understand it is that before each stage each workers needs to distribute
correct partitions (based on hash key ranges?) to other workers. And this
is where I would expect network input/output to spike. And after that the
processing should occur, where I would expect CPU to spike. If you check
the image I have attached earlier in this thread you can see that for
example between 8.25 and 8.30 Disk, network and CPU are almost not
utilized.

I would be very happy to receive some suggestions on how to debug this. Or
where you think would be a good place to start looking?

Kind regards, Domen


On Fri, Mar 21, 2014 at 6:58 PM, Mayur Rustagi [via Apache Spark User List]
 wrote:

> In your task details I dont see a large skew in tasks so the low cpu usage
> period occurs between stages or during stage execution.
> One issue possible is your data is 89GB Shuffle read, if the machine
> producing the shuffle data is not the one processing it, data shuffling
> across machines may be causing the delay.
> Can you look at your network traffic during that period to see performance.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Fri, Mar 21, 2014 at 8:33 AM, sparrow <[hidden 
> email]
> > wrote:
>
>> Here is the stage overview:
>> [image: Inline image 2]
>>
>> and here are the stage details for stage 0:
>> [image: Inline image 1]
>> Transformations from first stage to the second one are trivial, so that
>> should not be the bottle neck (apart from keyBy().groupByKey() that causes
>> the shuffle write/read).
>>
>> Kind regards, Domen
>>
>>
>>
>> On Thu, Mar 20, 2014 at 8:38 PM, Mayur Rustagi [via Apache Spark User
>> List] <[hidden email] 
>> > wrote:
>>
>>> I would have preferred the stage window details & aggregate task
>>> details(above the task list).
>>> Basically if you run a job , it translates to multiple stages, each
>>> stage translates to multiple tasks (each run on worker core).
>>> So some breakup like
>>> my job is taking 16 min
>>> 3 stages , stage 1 : 5 min Stage 2: 10 min & stage 3:1 min
>>> in Stage 2 give me task aggregate screenshot which talks about 50
>>> percentile, 75 percentile & 100 percentile.
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>>
>>> Ph: 
>>> +17602033257" target="_blank">>> href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257"
>>> target="_blank">+1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 9:55 AM, sparrow <[hidden 
>>> email]
>>> > wrote:
>>>

 This is what the web UI looks like:
 [image: Inline image 1]

 I also tail all the worker logs and theese are the last entries before
 the waiting begins:

 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, minRequest: 10066329
 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 29853 non-zero-bytes blocks out of 37714 blocks
 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 5 remote gets in  62 ms
 [PSYoungGen: 12464967K->3767331K(10552192K)]
 36074093K->29053085K(44805696K), 0.6765460 secs] [Times: user=5.35
 sys=0.02, real=0.67 secs]
 [PSYoungGen: 10779466K->3203826K(9806400K)]
 35384386K->31562169K(44059904K), 0.6925730 secs] [Times: user=5.47
 sys=0.00, real=0.70 secs]

 From the screenshot above you can see that task take ~ 6 minutes to
 complete. The amount of time it takes the tasks to complete seems to depend
 on the amount of input data. If s3 input string captures 2.5 times less
 data (less data to shuffle write  and later read), same tasks take 1
 minute. Any idea how to debug what the workers are doing?

 Domen

 On Wed, Mar 19, 2014 at 5:27 PM, Mayur Rustagi [via Apache Spark User
 List] <[hidden email]
 > wrote:

> You could have some outlier task that is preventing the next set of
> stages from launching. Can you check out stages state in the Spark WebUI,
> is any task running or is everything halted.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph:  value="+17602033257" target="_blank">+17602033257" target="_blank"> href="tel:%2B17602033257" value=" value="+17602033257" target="_blank">+17602033257" target="_blank"> href="tel:%2B17602033257" value="+17602033257" target="_blank">
> +17602033257" target="_blank"> href="tel:%2B1%20%28760%29%20203%203257" value=" href="tel:%2B17602033

RDD usage

2014-03-24 Thread Chieh-Yen
Dear all,

I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:

case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends
Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ..
}
Furthermore, I created the RDD by the following function.

def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ..
}

Assume the RDD called "points":

val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()

The question is that, I tried to modify the value of this RDD, the
operation is:

points.foreach(p=>p.y = another_value)

The operation is workable.
There doesn't have any warning or error message showed by the system and
the results are right.
I wonder that if the modification for RDD is a correct and in fact workable
design.
The usage web said that the RDD is immutable, is there any suggestion?

Thanks a lot.

Chieh-Yen Lin


Re: Spark Java example using external Jars

2014-03-24 Thread dmpour23
Hello,

Has anyone got any ideas? I am not quite sure if my problem is an exact fit
for Spark. Since in reality in this
section of my program i am not really doing a reduce job simply a group by
and partition. 

Would calling pipe on the Partiotined JavaRDD do the trick? Are there any
examples using pipe?

Thanks
Dimitri







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-example-using-external-Jars-tp2647p3092.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, 
the inode usage was about 50%. On two of the slaves, one had inode usage 
of 96% and on the other it was 100%. When i went into /tmp on these two 
nodes - there were a bunch of /tmp/spark* subdirectories which I 
deleted. This resulted in the inode consumption falling back down to 50% 
and the job running successfully to completion. The slave with the 100% 
inode usage had the spark/work/app//stdout with the message that 
the filesystem is running out of disk space (which I posted in the 
original email that started this thread).


What is interesting is that only two out of the 16 slaves had this 
problem :)


Ognen

On 3/24/14, 12:57 AM, Patrick Wendell wrote:

Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
 wrote:

I would love to work on this (and other) stuff if I can bother someone with
questions offline or on a dev mailing list.
Ognen


On 3/23/14, 10:04 PM, Aaron Davidson wrote:

Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our current
roadmap for state cleanup (cleaning up data which was not fully cleaned up
from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
 wrote:

Bleh, strike that, one of my slaves was at 100% inode utilization on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is going on.
A job that used to run on the same cluster is failing with this mysterious
message about not having enough disk space when in fact I can see through
"watch df -h" that the free space is always hovering around 3+GB on the disk
and the free inodes are at 50% (this is on master). I went through each
slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
and no mention of too many open files failures on any of the slaves nor on
the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created. With
spark.shuffle.consolidateFiles turned on, we would instead create only P
files. Disk space consumption is largely unaffected, however. by the number
of partitions unless each partition is particularly small.

You might look at the actual executors' logs, as it's possible that this
error was caused by an earlier exception, such as "too many open files".


On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
 wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than /tmp if
/tmp is full. Actually it's recommended to have multiple local disks and set
to to a comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a transformation influence
something in terms of disk space consumption? Or inode consumption?

Thanks,
Ognen



--
"A distributed system is one in which the failure of a computer you didn't
even know existed can render your own computer unusable"
-- Leslie Lamport



--
"No matter what they ever do to us, we must always act for the love of our
people and the earth. We must not react out of hatred against those who have
no sense."
-- John Trudell


--
“A distributed system is one in which the failure of a computer you didn’t even 
know existed can render your own computer unusable”
-- Leslie Lamport



Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Oh, glad to know it's that simple!

Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.

Nick

2014년 3월 24일 월요일, Patrick Wendell님이 작성한 메시지:

> As Mark said you can actually access this easily. The main issue I've
> seen from a performance perspective is people having a bunch of really
> small partitions. This will still work but the performance will
> improve if you coalesce the partitions using rdd.coalesce().
>
> This can happen for example if you do a highly selective filter on an
> RDD. For instance, you filter out one day of data from a dataset of a
> year.
>
> - Patrick
>
> On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra 
> >
> wrote:
> > It's much simpler: rdd.partitions.size
> >
> >
> > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> > > wrote:
> >>
> >> Hey there fellow Dukes of Data,
> >>
> >> How can I tell how many partitions my RDD is split into?
> >>
> >> I'm interested in knowing because, from what I gather, having a good
> >> number of partitions is good for performance. If I'm looking to
> understand
> >> how my pipeline is performing, say for a parallelized write out to HDFS,
> >> knowing how many partitions an RDD has would be a good thing to check.
> >>
> >> Is that correct?
> >>
> >> I could not find an obvious method or property to see how my RDD is
> >> partitioned. Instead, I devised the following thingy:
> >>
> >> def f(idx, itr): yield idx
> >>
> >> rdd = sc.parallelize([1, 2, 3, 4], 4)
> >> rdd.mapPartitionsWithIndex(f).count()
> >>
> >> Frankly, I'm not sure what I'm doing here, but this seems to give me the
> >> answer I'm looking for. Derp. :)
> >>
> >> So in summary, should I care about how finely my RDDs are partitioned?
> And
> >> how would I check on that?
> >>
> >> Nick
> >>
> >>
> >> 
> >> View this message in context: How many partitions is my RDD split into?
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>


Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread yaoxin
We now have a method to work this around.

For the classes that can't easily implement serialized, we wrap this class a
scala object.
For example:
 class A {}   // This class is not serializable, 
 object AHolder {
  private val a: A = new A()
  def get: A = a
}

This works cause spark will serialize the instance that are used in the
closure.
But the scala object won't be serialized and shipped. It's loaded locally in
each worker JVM.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


mapPartitions use case

2014-03-24 Thread Jaonary Rabarisoa
Dear all,

Sorry for asking such a basic question, but someone can explain when one
should use mapPartiontions instead of map.

Thanks

Jaonary


Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld
What does this error mean:

@hadoop-s2.oculus.local:45186]: Error [Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hadoop-s2.oculus.loca\
l/192.168.0.47:45186
]

?

-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Another thing I have noticed is that out of my master+15 slaves, two 
slaves always carry a higher inode load. So for example right now I am 
running an intensive job that takes about an hour to finish and two 
slaves have been showing an increase in inode consumption (they are 
about 10% above the rest of the slaves+master) and increasing.


Ognen

On 3/24/14, 7:00 AM, Ognen Duzlevski wrote:
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, 
the inode usage was about 50%. On two of the slaves, one had inode 
usage of 96% and on the other it was 100%. When i went into /tmp on 
these two nodes - there were a bunch of /tmp/spark* subdirectories 
which I deleted. This resulted in the inode consumption falling back 
down to 50% and the job running successfully to completion. The slave 
with the 100% inode usage had the spark/work/app//stdout with 
the message that the filesystem is running out of disk space (which I 
posted in the original email that started this thread).


What is interesting is that only two out of the 16 slaves had this 
problem :)


Ognen

On 3/24/14, 12:57 AM, Patrick Wendell wrote:

Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
 wrote:
I would love to work on this (and other) stuff if I can bother 
someone with

questions offline or on a dev mailing list.
Ognen


On 3/23/14, 10:04 PM, Aaron Davidson wrote:

Thanks for bringing this up, 100% inode utilization is an issue I 
haven't
seen raised before and this raises another issue which is not on our 
current
roadmap for state cleanup (cleaning up data which was not fully 
cleaned up

from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
 wrote:
Bleh, strike that, one of my slaves was at 100% inode utilization 
on the

file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is 
going on.
A job that used to run on the same cluster is failing with this 
mysterious
message about not having enough disk space when in fact I can see 
through
"watch df -h" that the free space is always hovering around 3+GB on 
the disk
and the free inodes are at 50% (this is on master). I went through 
each
slave and the spark/work/app*/stderr and stdout and spark/logs/*out 
files
and no mention of too many open files failures on any of the slaves 
nor on

the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created. With
spark.shuffle.consolidateFiles turned on, we would instead create 
only P
files. Disk space consumption is largely unaffected, however. by 
the number

of partitions unless each partition is particularly small.

You might look at the actual executors' logs, as it's possible that 
this
error was caused by an earlier exception, such as "too many open 
files".



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
 wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than 
/tmp if
/tmp is full. Actually it's recommended to have multiple local 
disks and set

to to a comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a transformation 
influence

something in terms of disk space consumption? Or inode consumption?

Thanks,
Ognen



--
"A distributed system is one in which the failure of a computer you 
didn't

even know existed can render your own computer unusable"
-- Leslie Lamport



--
"No matter what they ever do to us, we must always act for the love 
of our
people and the earth. We must not react out of hatred against those 
who have

no sense."
-- John Trudell




--
“A distributed system is one in which the failure of a computer you didn’t even 
know existed can render your own computer unusable”
-- Leslie Lamport



Re: mapPartitions use case

2014-03-24 Thread Nathan Kronenfeld
I've seen two cases most commonly:

The first is when I need to create some processing object to process each
record.  If that object creation is expensive, creating one per record
becomes prohibitive.  So instead, we use mapPartition, and create one per
partition, and use it on each record in the partition.

The other is I've often found it much more efficient, when summarizing
data, to use a mutable form of the summary object, running over each record
in a partition, then reduce those per-partition results, than to create a
summary object per record and reduce that much larger set pf summary
objects.  Again, it saves a lot of object creation.



On Mon, Mar 24, 2014 at 8:57 AM, Jaonary Rabarisoa wrote:

> Dear all,
>
> Sorry for asking such a basic question, but someone can explain when one
> should use mapPartiontions instead of map.
>
> Thanks
>
> Jaonary
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: Sliding Window operations do not work as documented

2014-03-24 Thread Sanjay Awatramani
Hi All,

I found out why this problem exists. Consider the following scenario:
- a DStream is created from any source. (I've checked with file and socket)
- No actions are applied to this DStream
- Sliding Window operation is applied to this DStream and an action is applied 
to the sliding window.
In this case, Spark will not even read the input stream in the batch in which 
the sliding interval isn't a multiple of batch interval. Put another way, it 
won't read the input when it doesn't have to apply the window function. This is 
happening because all transformations in Spark are lazy.

How to fix this or workaround it (see line#3):
JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(1 
* 60 * 1000));
JavaDStream inputStream = stcObj.textFileStream("/Input");
inputStream.print(); // This is the workaround
JavaDStream objWindow = inputStream.window(new 
Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
objWindow.dstream().saveAsTextFiles("/Output", "");


The "Window operations" example on the streaming guide implies that Spark will 
read the stream in every batch, which is not happening because of the lazy 
transformations.
Wherever sliding window would be used, in most of the cases, no actions will be 
taken on the pre-window batch, hence my gut feeling was that Streaming would 
read every batch if any actions are being taken in the windowed stream.

Regards,
Sanjay



On Friday, 21 March 2014 8:06 PM, Sanjay Awatramani  
wrote:
 
Hi,

I want to run a map/reduce process over last 5 seconds of data, every 4 
seconds. This is quite similar to the sliding window pictorial example under 
Window Operations section on 
http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html 
. 

The RDDs returned by window transformation function are incorrect in my case. 
To investigate this further, I ran a series of examples with varying values of 
window length & slide interval. Summary of the test results:
(window length, slide interval) -> result
(3,1) -> success
(4,2) -> success
(3,2) -> fail
(4,3) -> fail
(5,4) -> fail
(5,2) -> fail

The only condition mentioned in the doc is that the two values(5 & 4) should be 
multiples of batch interval(1 in my case) and obviously, I get a run time error 
if I attempt to violate this condition. Looking at my results, it seems that 
failures result when the slide interval isn't a multiple of window length.

My code:
JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(1 
* 60 * 1000));
JavaDStream inputStream = stcObj.textFileStream("/Input");
JavaDStream objWindow = inputStream.window(new 
Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
objWindow.dstream().saveAsTextFiles("/Output", "");

Detailed results:
(3,1) -> success
@t_0: [inputStream's RDD@t_0]
@t_1: [inputStream's RDD@t_0,1]
@t_2: [inputStream's RDD@t_0,1,2]
@t_3: [inputStream's RDD@t_1,2,3]
@t_4: [inputStream's RDD@t_2,3,4]
@t_5: [inputStream's RDD@t_3,4,5]

(4,2) -> success
@t_0: nothing
@t_1: [inputStream's RDD@t_0,1]
@t_2: nothing
@t_3: [inputStream's RDD@t_0,1,2,3]
@t_4: nothing
@t_5: [inputStream's RDD@t_2,3,4,5]

(3,2) -> fail
@t_0: nothing
@t_1: [inputStream's RDD@t_0,1]
@t_2: nothing
@t_3: [inputStream's RDD@t_2,3]    //(expected RDD@t_1,2,3)
@t_4: nothing
@t_5: [inputStream's RDD@t_4,5]    //(expected RDD@t_3,4,5)

(4,3) -> fail
@t_0: nothing
@t_1: nothing
@t_2: [inputStream's RDD@t_0,1,2]
@t_3: nothing
@t_4: nothing
@t_5: [inputStream's RDD@t_3,4,5]    //(expected RDD@t_2,3,4,5)

(5,4) -> fail
@t_0: nothing
@t_1: nothing
@t_2: nothing
@t_3: [inputStream's RDD@t_0,1,2,3]
@t_4: nothing
@t_5: nothing
@t_6: nothing
@t_7: [inputStream's RDD@t_4,5,6,7]    //(expected RDD@t_3,4,5,6,7)

(5,2) -> fail
@t_0: nothing
@t_1: [inputStream's RDD@t_0,1]
@t_2: nothing
@t_3: [inputStream's RDD@t_0,1,2,3]
@t_4: nothing
@t_5: [inputStream's RDD@t_2,3,4,5]    //(expected RDD@t_1,2,3,4,5)
@t_6: nothing
@t_7: [inputStream's RDD@t_4,5,6,7]    //(expected RDD@t_3,4,5,6,7)

I have run all the above examples twice to be sure !
I believe either my understanding of sliding window mechanism is incorrect or 
there is a problem in the sliding window mechanism.

Regards,
Sanjay

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Mark,

This appears to be a Scala-only feature. :(

Patrick,

Are we planning to add this to PySpark?

Nick


On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra wrote:

> It's much simpler: rdd.partitions.size
>
>
> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hey there fellow Dukes of Data,
>>
>> How can I tell how many partitions my RDD is split into?
>>
>> I'm interested in knowing because, from what I gather, having a good
>> number of partitions is good for performance. If I'm looking to understand
>> how my pipeline is performing, say for a parallelized write out to HDFS,
>> knowing how many partitions an RDD has would be a good thing to check.
>>
>> Is that correct?
>>
>> I could not find an obvious method or property to see how my RDD is
>> partitioned. Instead, I devised the following thingy:
>>
>> def f(idx, itr): yield idx
>>
>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>> rdd.mapPartitionsWithIndex(f).count()
>>
>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>> answer I'm looking for. Derp. :)
>>
>> So in summary, should I care about how finely my RDDs are partitioned?
>> And how would I check on that?
>>
>> Nick
>>
>>
>> --
>> View this message in context: How many partitions is my RDD split 
>> into?
>> Sent from the Apache Spark User List mailing list 
>> archiveat Nabble.com.
>>
>
>


Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Hi,

I'm running my program on a single large memory many core machine (64 cores,
1TB RAM). But to avoid having huge JVMs, I want to use several processes /
worker instances - each using 8 cores (i.e. use SPARK_WORKER_INSTANCES).
When I use 2 worker instances, everything works fine, but when I try using 4
or more worker instances and start the spark-shell, I get the following
exceptions by the workers:

14/03/24 08:18:51 ERROR ActorSystemImpl: Uncaught fatal error from thread
[spark-akka.actor.default-dispatcher-3] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:691)
at
scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
at
scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
at
scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
at 
scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(AbstractDispatcher.scala:374)
at
akka.dispatch.ExecutorServiceDelegate$class.execute(ThreadPoolBuilder.scala:212)
at
akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43)
at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:118)
at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:59)
at akka.actor.dungeon.Dispatch$class.sendMessage(Dispatch.scala:120)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.Cell$class.sendMessage(ActorCell.scala:259)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.LocalActorRef.$bang(ActorRef.scala:389)
at akka.actor.Scheduler$$anon$8.run(Scheduler.scala:62)
at
akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

*The config file spark-env.sh contains:*
export JAVA_HOME=/usr/java/jdk1.7.0_09
export PATH=/usr/java/jdk1.7.0_09/bin/:$PATH

export SPARK_JAVA_OPTS="-Dspark.executor.memory=80g
-Dspark.local.dir=/lfs/local/0/yonathan/tmp   -
Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
-Xms80g -Xmx80g 
-XX:-UseGCOverheadLimit -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps"

export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=80g
export SPARK_EXECUTOR_MEMORY=80g
export SPARK_DRIVER_MEMORY=10g
export SPARK_DAEMON_MEMORY=10g
export SPARK_WORKER_INSTANCES=4
export SPARK_DAEMON_JAVA_OPTS="-Xms10g -Xmx10g"

I use *Spark-0.9.0*

I would appreciate any help or advice on the subject.

Thanks!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-starting-worker-processes-in-standalone-mode-tp3102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Oh, I also forgot to mention:

I start the master and workers (call ./sbin/start-all.sh), and then start
the shell:
MASTER=spark://localhost:7077 ./bin/spark-shell

Then I get the exceptions...

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-starting-worker-processes-in-standalone-mode-tp3102p3103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pySpark memory usage

2014-03-24 Thread Matei Zaharia
Hey Jim,

In Spark 0.9 we added a “batchSize” parameter to PySpark that makes it group 
multiple objects together before passing them between Java and Python, but this 
may be too high by default. Try passing batchSize=10 to your SparkContext 
constructor to lower it (the default is 1024). Or even batchSize=1 to match 
earlier versions.

Matei

On Mar 21, 2014, at 6:18 PM, Jim Blomo  wrote:

> Hi all, I'm wondering if there's any settings I can use to reduce the
> memory needed by the PythonRDD when computing simple stats.  I am
> getting OutOfMemoryError exceptions while calculating count() on big,
> but not absurd, records.  It seems like PythonRDD is trying to keep
> too many of these records in memory, when all that is needed is to
> stream through them and count.  Any tips for getting through this
> workload?
> 
> 
> Code:
> session = sc.textFile('s3://...json.gz') # ~54GB of compressed data
> 
> # the biggest individual text line is ~3MB
> parsed = session.map(lambda l: l.split("\t",1)).map(lambda (y,s):
> (loads(y), loads(s)))
> parsed.persist(StorageLevel.MEMORY_AND_DISK)
> 
> parsed.count()
> # will never finish: executor.Executor: Uncaught exception will FAIL
> all executors
> 
> Incidentally the whole app appears to be killed, but this error is not
> propagated to the shell.
> 
> Cluster:
> 15 m2.xlarges (17GB memory, 17GB swap, spark.executor.memory=10GB)
> 
> Exception:
> java.lang.OutOfMemoryError: Java heap space
>at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:132)
>at 
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:120)
>at 
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:113)
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>at 
> org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:113)
>at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:94)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>at 
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)



remove duplicates

2014-03-24 Thread Adrian Mocanu
I have a DStream like this:
..RDD[a,b],RDD[b,c]..

Is there a way to remove duplicates across the entire DStream? Ie: I would like 
the output to be (by removing one of the b's):
..RDD[a],RDD[b,c]..  or ..RDD[a,b],RDD[c]..

Thanks
-Adrian



Re: sbt/sbt assembly fails with ssl certificate error

2014-03-24 Thread Debasish Das
After a long time (may be a month) I could do a fresh build for
2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache

My case is especially painful since I have to build behind a firewall...

@Sean thanks for the fix...I think we should put a test for https/firewall
compilation as well...


Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on master, so a large k would slow down
the initialization stage. If your data is sparse, the latest change to
KMeans will help with the speed, depending on how sparse your data is.
-Xiangrui

On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming  wrote:
> Thanks, Let me try with a smaller K.
>
> Does the size of the input data matters for the example? Currently I have 50M 
> rows. What is a reasonable size to demonstrate the capability of Spark?
>
>
>
>
>
> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:
>
>> K = 50 is certainly a large number for k-means. If there is no
>> particular reason to have 50 clusters, could you try to reduce it
>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>> problems. You should use the KMeans algorithm in mllib clustering for
>> your problem.
>>
>> -Xiangrui
>>
>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  wrote:
>>> Hi,
>>>
>>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>>>
>>> (0.9.0) is deployed in a stand alone mode.
>>>
>>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>>>
>>> This is on the first iteration. K=50. Here's the code I use:
>>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>>>
 Hi Tsai,

 Could you share more information about the machine you used and the
 training parameters (runs, k, and iterations)? It can help solve your
 issues. Thanks!

 Best,
 Xiangrui

 On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  
 wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
> working.
>
> I have -Dspark.default.parallelism=127 cores (n-1).
>
> CPU/Network/IO is idling across all nodes when this is happening.
>
> And there is nothing particular on the master log file. From the 
> spark-shell:
>
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
> on executor 2: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
> 38765155 bytes in 193 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
> on executor 1: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
> 38765155 bytes in 96 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
> on executor 0: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
> 38765155 bytes in 100 ms
>
> But it stops there for some significant time before any movement.
>
> In the stage detail of the UI, I can see that there are 127 tasks running 
> but the duration each is at least a few minutes.
>
> I'm working off local storage (not hdfs) and the kmeans data is about 
> 6.5GB (50M rows).
>
> Is this a normal behaviour?
>
> Thanks!
>>>
>


distinct in data frame in spark

2014-03-24 Thread Chengi Liu
Hi,
  I have a very simple use case:

I have an rdd as following:

d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]

Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row 1 or row 2 in my final result..
because the column 1 of both first and second row is the same element (1)
.. and hence the duplicate..
So, a possible result is:

output = [[1,2,3,4],[2,3,4,5]]

How do I do this in spark?
Thanks


Re: distinct on huge dataset

2014-03-24 Thread Aaron Davidson
Look up setting ulimit, though note the distinction between soft and hard
limits, and that updating your hard limit may require
changing /etc/security/limits.confand restarting each worker.


On Mon, Mar 24, 2014 at 1:39 AM, Kane  wrote:

> Got a bit further, i think out of memory error was caused by setting
> spark.spill to false. Now i have this error, is there an easy way to
> increase file limit for spark, cluster-wide?:
>
> java.io.FileNotFoundException:
>
> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
> (Too many open files)
> at java.io.FileOutputStream.openAppend(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:192)
> at
>
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
> at
>
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
> at
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
> at
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
> at
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
1. Note sure on this, I don't believe we change the defaults from Java.

2. SPARK_JAVA_OPTS can be used to set the various Java properties (other
than memory heap size itself)

3. If you want to have 8 GB executors then, yes, only two can run on each
16 GB node. (In fact, you should also keep a significant amount of memory
free for the OS to use for buffer caching and such.)
An executor may use many cores, though, so this shouldn't be an issue.


On Mon, Mar 24, 2014 at 2:44 AM, Sai Prasanna wrote:

> Thanks Aaron and Sean...
>
> Setting SPARK_MEM finally worked. But i have a small doubt.
> 1)What is the default value that is allocated for JVM and for HEAP_SPACE
> for Garbage collector.
>
> 2)Usually we set 1/3 of total memory for heap. So what should be the
> practice for Spark processes. Where & how should we set them.
> And what is the default value does it assume?
>
> 3) Moreover, if we set SPARK_MEM to say 8g and i have a 16g RAM, can only
> two executors run max on a node of a cluster ??
>
>
> Thanks Again !!
>
>
>
>
> On Mon, Mar 24, 2014 at 2:13 PM, Sean Owen  wrote:
>
>> PS you have a typo in "DEAMON" - its DAEMON. Thanks Latin.
>> On Mar 24, 2014 7:25 AM, "Sai Prasanna"  wrote:
>>
>>> Hi All !! I am getting the following error in interactive spark-shell
>>> [0.8.1]
>>>
>>>
>>>  *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
>>> than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
>>> exceeded*
>>>
>>>
>>> But i had set the following in the spark.env.sh and hadoop-env.sh
>>>
>>> export SPARK_DEAMON_MEMORY=8g
>>> export SPARK_WORKER_MEMORY=8g
>>> export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g"
>>> export SPARK_JAVA_OPTS="-Xms8g -Xmx8g"
>>>
>>>
>>> export HADOOP_HEAPSIZE=4000
>>>
>>> Any suggestions ??
>>>
>>> --
>>> *Sai Prasanna. AN*
>>> *II M.Tech (CS), SSSIHL*
>>>
>>>
>>>
>
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
> *Entire water in the ocean can never sink a ship, Unless it gets inside.
> All the pressures of life can never hurt you, Unless you let them in.*
>


Comparing GraphX and GraphLab

2014-03-24 Thread Niko Stahl
Hello,

I'm interested in extending the comparison between GraphX and GraphLab
presented in Xin et. al (2013). The evaluation presented there is rather
limited as it only compares the frameworks for one algorithm (PageRank) on
a cluster with a fixed number of nodes. Are there any graph algorithms
where one might expect GraphX to perform better than GraphLab? Do you
expect the scaling properties (i.e. performance as a function of # of
worker nodes) to differ?

Thanks for your help. I'd be happy to share the results of this study with
the community, if there's sufficient interest.

Regards,
Niko


Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Thanks Aaron !!


On Mon, Mar 24, 2014 at 10:58 PM, Aaron Davidson  wrote:

> 1. Note sure on this, I don't believe we change the defaults from Java.
>
> 2. SPARK_JAVA_OPTS can be used to set the various Java properties (other
> than memory heap size itself)
>
> 3. If you want to have 8 GB executors then, yes, only two can run on each
> 16 GB node. (In fact, you should also keep a significant amount of memory
> free for the OS to use for buffer caching and such.)
> An executor may use many cores, though, so this shouldn't be an issue.
>
>
> On Mon, Mar 24, 2014 at 2:44 AM, Sai Prasanna wrote:
>
>> Thanks Aaron and Sean...
>>
>> Setting SPARK_MEM finally worked. But i have a small doubt.
>> 1)What is the default value that is allocated for JVM and for HEAP_SPACE
>> for Garbage collector.
>>
>> 2)Usually we set 1/3 of total memory for heap. So what should be the
>> practice for Spark processes. Where & how should we set them.
>> And what is the default value does it assume?
>>
>> 3) Moreover, if we set SPARK_MEM to say 8g and i have a 16g RAM, can only
>> two executors run max on a node of a cluster ??
>>
>>
>> Thanks Again !!
>>
>>
>>
>>
>> On Mon, Mar 24, 2014 at 2:13 PM, Sean Owen  wrote:
>>
>>> PS you have a typo in "DEAMON" - its DAEMON. Thanks Latin.
>>> On Mar 24, 2014 7:25 AM, "Sai Prasanna"  wrote:
>>>
 Hi All !! I am getting the following error in interactive spark-shell
 [0.8.1]


  *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
 than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
 exceeded*


 But i had set the following in the spark.env.sh and hadoop-env.sh

 export SPARK_DEAMON_MEMORY=8g
 export SPARK_WORKER_MEMORY=8g
 export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g"
 export SPARK_JAVA_OPTS="-Xms8g -Xmx8g"


 export HADOOP_HEAPSIZE=4000

 Any suggestions ??

 --
 *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*



>>
>>
>> --
>> *Sai Prasanna. AN*
>> *II M.Tech (CS), SSSIHL*
>>
>>
>> *Entire water in the ocean can never sink a ship, Unless it gets inside.
>> All the pressures of life can never hurt you, Unless you let them in.*
>>
>
>


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*


Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das
Niko,

Comparing some other components will be very useful as wellsvd++ from
graphx vs the same algorithm in graphlabalso mllib.recommendation.als
implicit/explicit compared to the collaborative filtering toolkit in
graphlab...

To stress test what's the biggest sparse dataset that you have in mind ?
 On Mar 24, 2014 11:00 AM, "Niko Stahl"  wrote:

> Hello,
>
> I'm interested in extending the comparison between GraphX and GraphLab
> presented in Xin et. al (2013). The evaluation presented there is rather
> limited as it only compares the frameworks for one algorithm (PageRank) on
> a cluster with a fixed number of nodes. Are there any graph algorithms
> where one might expect GraphX to perform better than GraphLab? Do you
> expect the scaling properties (i.e. performance as a function of # of
> worker nodes) to differ?
>
> Thanks for your help. I'd be happy to share the results of this study with
> the community, if there's sufficient interest.
>
> Regards,
> Niko
>


Re: Transitive dependency incompatibility

2014-03-24 Thread Jaka Jančar
Just found this issue and wanted to link it here, in case somebody finds this 
thread later:  

https://spark-project.atlassian.net/browse/SPARK-939



On Thursday, March 20, 2014 at 11:14 AM, Matei Zaharia wrote:

> Hi Jaka,
>  
> I’d recommend rebuilding Spark with a new version of the HTTPClient 
> dependency. In the future we want to add a “put the user’s classpath first” 
> option to let users overwrite dependencies.
>  
> Matei
>  
> On Mar 20, 2014, at 8:42 AM, Jaka Jančar  (mailto:j...@kubje.org)> wrote:
> > Could the executor use isolated classloaders, in order not to pullute the 
> > environment with it's own stuff?
> >  
> >  
> >  
> >  
> >  
> > On Wednesday, March 19, 2014 at 8:30 AM, Jaka Jančar wrote:
> >  
> > > Hi,
> > >  
> > > I'm getting the following error:
> > >  
> > > java.lang.NoSuchMethodError: 
> > > org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
> > > at 
> > > org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
> > > at 
> > > org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:114)
> > > at 
> > > org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:99)
> > > at 
> > > org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:85)
> > > at 
> > > org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:93)
> > > at 
> > > com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
> > > at 
> > > com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
> > > at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:155)
> > > at 
> > > com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:119)
> > > at 
> > > com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:103)
> > > at 
> > > com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:334)
> > > at 
> > > com.celtra.analyzer.TrackingLogRDD.createClient(TrackingLogRDD.scala:131)
> > > at com.celtra.analyzer.TrackingLogRDD.compute(TrackingLogRDD.scala:117)
> > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
> > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
> > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)
> > > at org.apache.spark.scheduler.Task.run(Task.scala:53)
> > > at 
> > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)
> > > at 
> > > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)
> > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
> > > at 
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > > at 
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > at java.lang.Thread.run(Thread.java:744)
> > >  
> > > My app uses the AWS SDK, which requires 
> > > org.apache.httpcomponents:httpclient:4.2.
> > >  
> > > I believe the error is caused by the fact that an older version of the 
> > > package is already present on the classpath:
> > >  
> > > Spark -> Akka -> sjson -> org.apache.httpcomponents:httpclient:4.1
> > > Spark -> jets3t -> commons-httpclient:commons-httpclient:3.1
> > >  
> > >  
> > > What are my options if I need to use a newer version of the library in my 
> > > app?
> > >  
> > > Thanks,
> > > Jaka
> > >  
> > >  
> > >  
> > >  
> > >  
> >  
>  



quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Has anyone successfully followed the instructions on the Quick Start page
of the Spark home page to run a "standalone" Scala application?  I can't,
and I figure I must be missing something obvious!

I'm trying to follow the instructions here as close to "word for word" as
possible:
http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala

1.  The instructions don't say what directory to create my test application
in, but later I'm instructed to run "sbt/sbt" so I conclude that my working
directory must be $SPARK_HOME.  (Temporarily ignoring that it is a little
weird to be working directly in the Spark distro.)

2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
 Copy&paste in the code from the instructions exactly, replacing
YOUR_SPARK_HOME with my spark home path.

3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt file
from the instructions

4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the
ENTIRE Spark project!  This takes several minutes, and at the end, it says
"Done packaging".  unfortunately, there's nothing in the
$SPARK_HOME/mysparktest/ folder other than what I already had there.

(Just for fun, I also did what I thought was more logical, which is set my
working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt
package, but that was even less successful: I got an error:
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
install sbt manually from http://www.scala-sbt.org/


So, help?  I'm sure these instructions work because people are following
them every day, but I can't tell what they are supposed to do.

Thanks!
Diana


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Hi, Diana,   

See my inlined answer  

--  
Nan Zhu



On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:

> Has anyone successfully followed the instructions on the Quick Start page of 
> the Spark home page to run a "standalone" Scala application?  I can't, and I 
> figure I must be missing something obvious!
>  
> I'm trying to follow the instructions here as close to "word for word" as 
> possible:
> http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>  
> 1.  The instructions don't say what directory to create my test application 
> in, but later I'm instructed to run "sbt/sbt" so I conclude that my working 
> directory must be $SPARK_HOME.  (Temporarily ignoring that it is a little 
> weird to be working directly in the Spark distro.)

You can create your application in any directory, just follow the sbt project 
dir structure
>  
> 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.  
> Copy&paste in the code from the instructions exactly, replacing 
> YOUR_SPARK_HOME with my spark home path.

should be correct  
>  
> 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt file 
> from the instructions

should be correct   
>  
> 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the ENTIRE 
> Spark project!  This takes several minutes, and at the end, it says "Done 
> packaging".  unfortunately, there's nothing in the $SPARK_HOME/mysparktest/ 
> folder other than what I already had there.   

because you are in Spark directory, don’t need to do that actually , the 
dependency on Spark is resolved by sbt
  
>  
> (Just for fun, I also did what I thought was more logical, which is set my 
> working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt 
> package, but that was even less successful: I got an error:  
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
> install sbt manually from http://www.scala-sbt.org/
>  
>  
>  
> So, help?  I'm sure these instructions work because people are following them 
> every day, but I can't tell what they are supposed to do.   
>  
> Thanks!  
> Diana
>  



Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
I am able to run standalone apps. I think you are making one mistake
that throws you off from there onwards. You don't need to put your app
under SPARK_HOME. I would create it in its own folder somewhere, it
follows the rules of any standalone scala program (including the
layout). In the giude, $SPARK_HOME is only relevant to find the Readme
file which they are parsing/word-counting. But otherwise the compile
time dependencies on spark would be resolved via the sbt file (or the
pom file if you look at the Java example).

So for example I put my app under C:\Source\spark-code and the jar
gets created in C:\Source\spark-code\target\scala-2.9.3 (or 2.10 if
you're running with scala 2.10 as the example shows). But for that
part of the guide, it's not any different than building a scala app.

On Mon, Mar 24, 2014 at 3:44 PM, Diana Carroll  wrote:
> Has anyone successfully followed the instructions on the Quick Start page of
> the Spark home page to run a "standalone" Scala application?  I can't, and I
> figure I must be missing something obvious!
>
> I'm trying to follow the instructions here as close to "word for word" as
> possible:
> http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>
> 1.  The instructions don't say what directory to create my test application
> in, but later I'm instructed to run "sbt/sbt" so I conclude that my working
> directory must be $SPARK_HOME.  (Temporarily ignoring that it is a little
> weird to be working directly in the Spark distro.)
>
> 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
> Copy&paste in the code from the instructions exactly, replacing
> YOUR_SPARK_HOME with my spark home path.
>
> 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt file
> from the instructions
>
> 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the
> ENTIRE Spark project!  This takes several minutes, and at the end, it says
> "Done packaging".  unfortunately, there's nothing in the
> $SPARK_HOME/mysparktest/ folder other than what I already had there.
>
> (Just for fun, I also did what I thought was more logical, which is set my
> working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt
> package, but that was even less successful: I got an error:
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> install sbt manually from http://www.scala-sbt.org/
>
>
> So, help?  I'm sure these instructions work because people are following
> them every day, but I can't tell what they are supposed to do.
>
> Thanks!
> Diana
>


Re: Comparing GraphX and GraphLab

2014-03-24 Thread Ankur Dave
Hi Niko,

The GraphX team recently wrote a longer paper with more benchmarks and
optimizations: http://arxiv.org/abs/1402.2394

Regarding the performance of GraphX vs. GraphLab, I believe GraphX
currently outperforms GraphLab only in end-to-end benchmarks of pipelines
involving both graph-parallel operations (e.g. PageRank) and data-parallel
operations (e.g. ETL and data cleaning). This is due to the overhead of
moving data between GraphLab and a data-parallel system like Spark. There's
an example of a pipeline in Section 5.2 in the linked paper, and the
results are in Figure 10 on page 11.

GraphX has a very similar architecture as GraphLab, so I wouldn't expect it
to have better performance on pure graph algorithms. GraphX may actually be
slower when Spark is configured to launch many tasks per machine, because
shuffle communication between Spark tasks on the same machine still occurs
by reading and writing from disk, while GraphLab uses shared memory for
same-machine communication.

I've CC'd Joey and Reynold as well.

Ankur 

On Mar 24, 2014 11:00 AM, "Niko Stahl"  wrote:

> I'm interested in extending the comparison between GraphX and GraphLab
> presented in Xin et. al (2013). The evaluation presented there is rather
> limited as it only compares the frameworks for one algorithm (PageRank) on
> a cluster with a fixed number of nodes. Are there any graph algorithms
> where one might expect GraphX to perform better than GraphLab? Do you
> expect the scaling properties (i.e. performance as a function of # of
> worker nodes) to differ?
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Yana: Thanks.  Can you give me a transcript of the actual commands you are
running?

THanks!
Diana


On Mon, Mar 24, 2014 at 3:59 PM, Yana Kadiyska wrote:

> I am able to run standalone apps. I think you are making one mistake
> that throws you off from there onwards. You don't need to put your app
> under SPARK_HOME. I would create it in its own folder somewhere, it
> follows the rules of any standalone scala program (including the
> layout). In the giude, $SPARK_HOME is only relevant to find the Readme
> file which they are parsing/word-counting. But otherwise the compile
> time dependencies on spark would be resolved via the sbt file (or the
> pom file if you look at the Java example).
>
> So for example I put my app under C:\Source\spark-code and the jar
> gets created in C:\Source\spark-code\target\scala-2.9.3 (or 2.10 if
> you're running with scala 2.10 as the example shows). But for that
> part of the guide, it's not any different than building a scala app.
>
> On Mon, Mar 24, 2014 at 3:44 PM, Diana Carroll 
> wrote:
> > Has anyone successfully followed the instructions on the Quick Start
> page of
> > the Spark home page to run a "standalone" Scala application?  I can't,
> and I
> > figure I must be missing something obvious!
> >
> > I'm trying to follow the instructions here as close to "word for word" as
> > possible:
> >
> http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
> >
> > 1.  The instructions don't say what directory to create my test
> application
> > in, but later I'm instructed to run "sbt/sbt" so I conclude that my
> working
> > directory must be $SPARK_HOME.  (Temporarily ignoring that it is a little
> > weird to be working directly in the Spark distro.)
> >
> > 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
> > Copy&paste in the code from the instructions exactly, replacing
> > YOUR_SPARK_HOME with my spark home path.
> >
> > 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt
> file
> > from the instructions
> >
> > 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the
> > ENTIRE Spark project!  This takes several minutes, and at the end, it
> says
> > "Done packaging".  unfortunately, there's nothing in the
> > $SPARK_HOME/mysparktest/ folder other than what I already had there.
> >
> > (Just for fun, I also did what I thought was more logical, which is set
> my
> > working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt
> > package, but that was even less successful: I got an error:
> > awk: cmd. line:1: fatal: cannot open file `./project/build.properties'
> for
> > reading (No such file or directory)
> > Attempting to fetch sbt
> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > directory
> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > directory
> > Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> > install sbt manually from http://www.scala-sbt.org/
> >
> >
> > So, help?  I'm sure these instructions work because people are following
> > them every day, but I can't tell what they are supposed to do.
> >
> > Thanks!
> > Diana
> >
>


[bug?] streaming window unexpected behaviour

2014-03-24 Thread Adrian Mocanu
I have what I would call unexpected behaviour when using window on a stream.
I have 2 windowed streams with a 5s batch interval. One window stream is 
(5s,5s)=smallWindow and the other (10s,5s)=bigWindow
What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is 
of the size 5s not 10s. So instead of waiting 10s and producing 1 RDD with size 
10s, Spark produced the 1st 10s RDD of size 5s.
Why is this happening? To me it looks like a bug; Matei or TD can you verify 
that this is correct behaviour?


I have the following code
val ssc = new StreamingContext(conf, Seconds(5))

val smallWindowStream = ssc.queueStream(smallWindowRddQueue)
val bigWindowStream = ssc.queueStream(bigWindowRddQueue)

val smallWindow = smallWindowReshapedStream.window(Seconds(5), Seconds(5))
  .reduceByKey((t1, t2) => (t1._1, t1._2, t1._3 + t2._3))
val bigWindow = bigWindowReshapedStream.window(Seconds(10), Seconds(5))
.reduceByKey((t1, t2) => (t1._1, t1._2, t1._3 + t2._3))

-Adrian



Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski

Diana,

Anywhere on the filesystem you have read/write access (you need not be 
in your spark home directory):


mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .

Make sure that myproject.sbt has the following in it:

name := "I NEED A NAME!"

version := "I NEED A VERSION!"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % 
"0.9.0-incubating"


If you will be using Hadoop/HDFS functionality you will need the below 
line also


libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"

The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are 
using 0.8.1 - adjust appropriately.


That's it. Now you can do

sbt compile
sbt run 

You can also do
sbt package to produce a jar file of your code which you can then add to 
the SparkContext at runtime.


In a more complicated project you may need to have a bit more involved 
hierarchy like com.github.dianacarroll which will then translate to 
src/main/scala/com/github/dianacarroll/ where you can put your multiple 
.scala files which will then have to be a part of a package 
com.github.dianacarroll (you can just put that as your first line in 
each of these scala files). I am new to Java/Scala so this is how I do 
it. More educated Java/Scala programmers may tell you otherwise ;)


You can get more complicated with the sbt project subrirectory but you 
can read independently about sbt and what it can do, above is the bare 
minimum.


Let me know if that helped.
Ognen

On 3/24/14, 2:44 PM, Diana Carroll wrote:
Has anyone successfully followed the instructions on the Quick Start 
page of the Spark home page to run a "standalone" Scala application? 
 I can't, and I figure I must be missing something obvious!


I'm trying to follow the instructions here as close to "word for word" 
as possible:

http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala

1.  The instructions don't say what directory to create my test 
application in, but later I'm instructed to run "sbt/sbt" so I 
conclude that my working directory must be $SPARK_HOME.  (Temporarily 
ignoring that it is a little weird to be working directly in the Spark 
distro.)


2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala. 
 Copy&paste in the code from the instructions exactly, replacing 
YOUR_SPARK_HOME with my spark home path.


3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt 
file from the instructions


4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the 
ENTIRE Spark project!  This takes several minutes, and at the end, it 
says "Done packaging".  unfortunately, there's nothing in the 
$SPARK_HOME/mysparktest/ folder other than what I already had there.


(Just for fun, I also did what I thought was more logical, which is 
set my working directory to $SPARK_HOME/mysparktest, and but 
$SPARK_HOME/sbt/sbt package, but that was even less successful: I got 
an error:
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' 
for reading (No such file or directory)

Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. 
Please install sbt manually from http://www.scala-sbt.org/



So, help?  I'm sure these instructions work because people are 
following them every day, but I can't tell what they are supposed to do.


Thanks!
Diana




Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks, Nan Zhu.

You say that my problems are "because you are in Spark directory, don't
need to do that actually , the dependency on Spark is resolved by sbt"

I did try it initially in what I thought was a much more typical place,
e.g. ~/mywork/sparktest1.  But as I said in my email:

(Just for fun, I also did what I thought was more logical, which is set my
working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt
package, but that was even less successful: I got an error:
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
install sbt manually from http://www.scala-sbt.org/



On Mon, Mar 24, 2014 at 4:00 PM, Nan Zhu  wrote:

>  Hi, Diana,
>
> See my inlined answer
>
> --
> Nan Zhu
>
>
> On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:
>
> Has anyone successfully followed the instructions on the Quick Start page
> of the Spark home page to run a "standalone" Scala application?  I can't,
> and I figure I must be missing something obvious!
>
> I'm trying to follow the instructions here as close to "word for word" as
> possible:
>
> http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>
> 1.  The instructions don't say what directory to create my test
> application in, but later I'm instructed to run "sbt/sbt" so I conclude
> that my working directory must be $SPARK_HOME.  (Temporarily ignoring that
> it is a little weird to be working directly in the Spark distro.)
>
>
> You can create your application in any directory, just follow the sbt
> project dir structure
>
>
> 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
>  Copy&paste in the code from the instructions exactly, replacing
> YOUR_SPARK_HOME with my spark home path.
>
>
> should be correct
>
>
> 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt file
> from the instructions
>
>
> should be correct
>
>
> 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the
> ENTIRE Spark project!  This takes several minutes, and at the end, it says
> "Done packaging".  unfortunately, there's nothing in the
> $SPARK_HOME/mysparktest/ folder other than what I already had there.
>
>
> because you are in Spark directory, don't need to do that actually , the
> dependency on Spark is resolved by sbt
>
>
>
> (Just for fun, I also did what I thought was more logical, which is set my
> working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt
> package, but that was even less successful: I got an error:
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> install sbt manually from http://www.scala-sbt.org/
>
>
> So, help?  I'm sure these instructions work because people are following
> them every day, but I can't tell what they are supposed to do.
>
> Thanks!
> Diana
>
>
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks Ongen.

Unfortunately I'm not able to follow your instructions either.  In
particular:

>
> sbt compile
> sbt run 


This doesn't work for me because there's no program on my path called
"sbt".  The instructions in the Quick Start guide are specific that I
should call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my
system called "sbt".

Did you download and install sbt separately?  In following the Quick Start
guide, that was not stated as a requirement, and I'm trying to run through
the guide word for word.

Diana


On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:

> Diana,
>
> Anywhere on the filesystem you have read/write access (you need not be in
> your spark home directory):
>
> mkdir myproject
> cd myproject
> mkdir project
> mkdir target
> mkdir -p src/main/scala
> cp $mypath/$mymysource.scala src/main/scala/
> cp $mypath/myproject.sbt .
>
> Make sure that myproject.sbt has the following in it:
>
> name := "I NEED A NAME!"
>
> version := "I NEED A VERSION!"
>
> scalaVersion := "2.10.3"
>
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
> "0.9.0-incubating"
>
> If you will be using Hadoop/HDFS functionality you will need the below
> line also
>
> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
>
> The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are
> using 0.8.1 - adjust appropriately.
>
> That's it. Now you can do
>
> sbt compile
> sbt run 
>
> You can also do
> sbt package to produce a jar file of your code which you can then add to
> the SparkContext at runtime.
>
> In a more complicated project you may need to have a bit more involved
> hierarchy like com.github.dianacarroll which will then translate to
> src/main/scala/com/github/dianacarroll/ where you can put your multiple
> .scala files which will then have to be a part of a package
> com.github.dianacarroll (you can just put that as your first line in each
> of these scala files). I am new to Java/Scala so this is how I do it. More
> educated Java/Scala programmers may tell you otherwise ;)
>
> You can get more complicated with the sbt project subrirectory but you can
> read independently about sbt and what it can do, above is the bare minimum.
>
> Let me know if that helped.
> Ognen
>
>
> On 3/24/14, 2:44 PM, Diana Carroll wrote:
>
>> Has anyone successfully followed the instructions on the Quick Start page
>> of the Spark home page to run a "standalone" Scala application?  I can't,
>> and I figure I must be missing something obvious!
>>
>> I'm trying to follow the instructions here as close to "word for word" as
>> possible:
>> http://spark.apache.org/docs/latest/quick-start.html#a-
>> standalone-app-in-scala
>>
>> 1.  The instructions don't say what directory to create my test
>> application in, but later I'm instructed to run "sbt/sbt" so I conclude
>> that my working directory must be $SPARK_HOME.  (Temporarily ignoring that
>> it is a little weird to be working directly in the Spark distro.)
>>
>> 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
>>  Copy&paste in the code from the instructions exactly, replacing
>> YOUR_SPARK_HOME with my spark home path.
>>
>> 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt
>> file from the instructions
>>
>> 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the
>> ENTIRE Spark project!  This takes several minutes, and at the end, it says
>> "Done packaging".  unfortunately, there's nothing in the
>> $SPARK_HOME/mysparktest/ folder other than what I already had there.
>>
>> (Just for fun, I also did what I thought was more logical, which is set
>> my working directory to $SPARK_HOME/mysparktest, and but
>> $SPARK_HOME/sbt/sbt package, but that was even less successful: I got an
>> error:
>> awk: cmd. line:1: fatal: cannot open file `./project/build.properties'
>> for reading (No such file or directory)
>> Attempting to fetch sbt
>> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
>> directory
>> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
>> directory
>> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
>> install sbt manually from http://www.scala-sbt.org/
>>
>>
>> So, help?  I'm sure these instructions work because people are following
>> them every day, but I can't tell what they are supposed to do.
>>
>> Thanks!
>> Diana
>>
>
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Bharath Bhushan
Creating simple.sbt and src/ in $SPARK_HOME allows me to run a standalone scala 
program in the downloaded spark code tree. For example my directory layout is:
$ ls spark-0.9.0-incubating-bin-hadoop2
…
simple.sbt
src
…

$ tree src
src
`-- main
`-- scala
`— SimpleApp.scala

— Bharath

On 25-Mar-2014, at 1:46 am, Diana Carroll  wrote:

> Thanks Ongen.
> 
> Unfortunately I'm not able to follow your instructions either.  In particular:
> 
> sbt compile
> sbt run 
> 
> This doesn't work for me because there's no program on my path called "sbt".  
> The instructions in the Quick Start guide are specific that I should call 
> "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my system called 
> "sbt".
> 
> Did you download and install sbt separately?  In following the Quick Start 
> guide, that was not stated as a requirement, and I'm trying to run through 
> the guide word for word.
> 
> Diana
> 
> 
> On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski 
>  wrote:
> Diana,
> 
> Anywhere on the filesystem you have read/write access (you need not be in 
> your spark home directory):
> 
> mkdir myproject
> cd myproject
> mkdir project
> mkdir target
> mkdir -p src/main/scala
> cp $mypath/$mymysource.scala src/main/scala/
> cp $mypath/myproject.sbt .
> 
> Make sure that myproject.sbt has the following in it:
> 
> name := "I NEED A NAME!"
> 
> version := "I NEED A VERSION!"
> 
> scalaVersion := "2.10.3"
> 
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % 
> "0.9.0-incubating"
> 
> If you will be using Hadoop/HDFS functionality you will need the below line 
> also
> 
> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
> 
> The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are using 
> 0.8.1 - adjust appropriately.
> 
> That's it. Now you can do
> 
> sbt compile
> sbt run 
> 
> You can also do
> sbt package to produce a jar file of your code which you can then add to the 
> SparkContext at runtime.
> 
> In a more complicated project you may need to have a bit more involved 
> hierarchy like com.github.dianacarroll which will then translate to 
> src/main/scala/com/github/dianacarroll/ where you can put your multiple 
> .scala files which will then have to be a part of a package 
> com.github.dianacarroll (you can just put that as your first line in each of 
> these scala files). I am new to Java/Scala so this is how I do it. More 
> educated Java/Scala programmers may tell you otherwise ;)
> 
> You can get more complicated with the sbt project subrirectory but you can 
> read independently about sbt and what it can do, above is the bare minimum.
> 
> Let me know if that helped.
> Ognen
> 
> 
> On 3/24/14, 2:44 PM, Diana Carroll wrote:
> Has anyone successfully followed the instructions on the Quick Start page of 
> the Spark home page to run a "standalone" Scala application?  I can't, and I 
> figure I must be missing something obvious!
> 
> I'm trying to follow the instructions here as close to "word for word" as 
> possible:
> http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
> 
> 1.  The instructions don't say what directory to create my test application 
> in, but later I'm instructed to run "sbt/sbt" so I conclude that my working 
> directory must be $SPARK_HOME.  (Temporarily ignoring that it is a little 
> weird to be working directly in the Spark distro.)
> 
> 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.  
> Copy&paste in the code from the instructions exactly, replacing 
> YOUR_SPARK_HOME with my spark home path.
> 
> 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt file 
> from the instructions
> 
> 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the ENTIRE 
> Spark project!  This takes several minutes, and at the end, it says "Done 
> packaging".  unfortunately, there's nothing in the $SPARK_HOME/mysparktest/ 
> folder other than what I already had there.
> 
> (Just for fun, I also did what I thought was more logical, which is set my 
> working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt 
> package, but that was even less successful: I got an error:
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
> install sbt manually from http://www.scala-sbt.org/
> 
> 
> So, help?  I'm sure these instructions work because people are following them 
> every day, but I can't tell what they are supposed to do.
> 
> Thanks!
> Diana
> 
> 



Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman
There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example

a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()


On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Mark,
>
> This appears to be a Scala-only feature. :(
>
> Patrick,
>
> Are we planning to add this to PySpark?
>
> Nick
>
>
> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra wrote:
>
>> It's much simpler: rdd.partitions.size
>>
>>
>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Hey there fellow Dukes of Data,
>>>
>>> How can I tell how many partitions my RDD is split into?
>>>
>>> I'm interested in knowing because, from what I gather, having a good
>>> number of partitions is good for performance. If I'm looking to understand
>>> how my pipeline is performing, say for a parallelized write out to HDFS,
>>> knowing how many partitions an RDD has would be a good thing to check.
>>>
>>> Is that correct?
>>>
>>> I could not find an obvious method or property to see how my RDD is
>>> partitioned. Instead, I devised the following thingy:
>>>
>>> def f(idx, itr): yield idx
>>>
>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>> rdd.mapPartitionsWithIndex(f).count()
>>>
>>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>>> answer I'm looking for. Derp. :)
>>>
>>> So in summary, should I care about how finely my RDDs are partitioned?
>>> And how would I check on that?
>>>
>>> Nick
>>>
>>>
>>> --
>>> View this message in context: How many partitions is my RDD split 
>>> into?
>>> Sent from the Apache Spark User List mailing list 
>>> archiveat Nabble.com.
>>>
>>
>>
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
You can use any sbt on your machine, including the one that comes with 
spark. For example, try:


~/path_to_spark/sbt/sbt compile
~/path_to_spark/sbt/sbt run 

Or you can just add that to your PATH by:

export $PATH=$PATH:~/path_to_spark/sbt

To make it permanent, you can add it to your ~/.bashrc or 
~/.bash_profile or ??? depending on the system you are using. If you are 
on Windows, sorry, I can't offer any help there ;)


Ognen

On 3/24/14, 3:16 PM, Diana Carroll wrote:

Thanks Ongen.

Unfortunately I'm not able to follow your instructions either.  In 
particular:



sbt compile
sbt run 


This doesn't work for me because there's no program on my path called 
"sbt".  The instructions in the Quick Start guide are specific that I 
should call "$SPARK_HOME/sbt/sbt".  I don't have any other executable 
on my system called "sbt".


Did you download and install sbt separately?  In following the Quick 
Start guide, that was not stated as a requirement, and I'm trying to 
run through the guide word for word.


Diana


On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski 
mailto:og...@plainvanillagames.com>> wrote:


Diana,

Anywhere on the filesystem you have read/write access (you need
not be in your spark home directory):

mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .

Make sure that myproject.sbt has the following in it:

name := "I NEED A NAME!"

version := "I NEED A VERSION!"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
"0.9.0-incubating"

If you will be using Hadoop/HDFS functionality you will need the
below line also

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"

The above assumes you are using Spark 0.9 and Scala 2.10.3. If you
are using 0.8.1 - adjust appropriately.

That's it. Now you can do

sbt compile
sbt run 

You can also do
sbt package to produce a jar file of your code which you can then
add to the SparkContext at runtime.

In a more complicated project you may need to have a bit more
involved hierarchy like com.github.dianacarroll which will then
translate to src/main/scala/com/github/dianacarroll/ where you can
put your multiple .scala files which will then have to be a part
of a package com.github.dianacarroll (you can just put that as
your first line in each of these scala files). I am new to
Java/Scala so this is how I do it. More educated Java/Scala
programmers may tell you otherwise ;)

You can get more complicated with the sbt project subrirectory but
you can read independently about sbt and what it can do, above is
the bare minimum.

Let me know if that helped.
Ognen


On 3/24/14, 2:44 PM, Diana Carroll wrote:

Has anyone successfully followed the instructions on the Quick
Start page of the Spark home page to run a "standalone" Scala
application?  I can't, and I figure I must be missing
something obvious!

I'm trying to follow the instructions here as close to "word
for word" as possible:

http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala

1.  The instructions don't say what directory to create my
test application in, but later I'm instructed to run "sbt/sbt"
so I conclude that my working directory must be $SPARK_HOME.
 (Temporarily ignoring that it is a little weird to be working
directly in the Spark distro.)

2.  Create
$SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
 Copy&paste in the code from the instructions exactly,
replacing YOUR_SPARK_HOME with my spark home path.

3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in
the sbt file from the instructions

4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs
through the ENTIRE Spark project!  This takes several minutes,
and at the end, it says "Done packaging".  unfortunately,
there's nothing in the $SPARK_HOME/mysparktest/ folder other
than what I already had there.

(Just for fun, I also did what I thought was more logical,
which is set my working directory to $SPARK_HOME/mysparktest,
and but $SPARK_HOME/sbt/sbt package, but that was even less
successful: I got an error:
awk: cmd. line:1: fatal: cannot open file
`./project/build.properties' for reading (No such file or
directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such
file or directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such
file or directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar
failed. Please install sbt m

question about partitions

2014-03-24 Thread Walrus theCat
Hi,

Quick question about partitions.  If my RDD is partitioned into 5
partitions, does that mean that I am constraining it to exist on at most 5
machines?

Thanks


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Yeah, that's exactly what I did. Unfortunately it doesn't work:

$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
install sbt manually from http://www.scala-sbt.org/



On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:

>  You can use any sbt on your machine, including the one that comes with
> spark. For example, try:
>
> ~/path_to_spark/sbt/sbt compile
> ~/path_to_spark/sbt/sbt run 
>
> Or you can just add that to your PATH by:
>
> export $PATH=$PATH:~/path_to_spark/sbt
>
> To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile
> or ??? depending on the system you are using. If you are on Windows, sorry,
> I can't offer any help there ;)
>
> Ognen
>
>
> On 3/24/14, 3:16 PM, Diana Carroll wrote:
>
> Thanks Ongen.
>
>  Unfortunately I'm not able to follow your instructions either.  In
> particular:
>
>>
>> sbt compile
>> sbt run 
>
>
>  This doesn't work for me because there's no program on my path called
> "sbt".  The instructions in the Quick Start guide are specific that I
> should call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my
> system called "sbt".
>
>  Did you download and install sbt separately?  In following the Quick
> Start guide, that was not stated as a requirement, and I'm trying to run
> through the guide word for word.
>
>  Diana
>
>
>  On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski <
> og...@plainvanillagames.com> wrote:
>
>> Diana,
>>
>> Anywhere on the filesystem you have read/write access (you need not be in
>> your spark home directory):
>>
>> mkdir myproject
>> cd myproject
>> mkdir project
>> mkdir target
>> mkdir -p src/main/scala
>> cp $mypath/$mymysource.scala src/main/scala/
>> cp $mypath/myproject.sbt .
>>
>> Make sure that myproject.sbt has the following in it:
>>
>> name := "I NEED A NAME!"
>>
>> version := "I NEED A VERSION!"
>>
>> scalaVersion := "2.10.3"
>>
>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>> "0.9.0-incubating"
>>
>> If you will be using Hadoop/HDFS functionality you will need the below
>> line also
>>
>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
>>
>> The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are
>> using 0.8.1 - adjust appropriately.
>>
>> That's it. Now you can do
>>
>> sbt compile
>> sbt run 
>>
>> You can also do
>> sbt package to produce a jar file of your code which you can then add to
>> the SparkContext at runtime.
>>
>> In a more complicated project you may need to have a bit more involved
>> hierarchy like com.github.dianacarroll which will then translate to
>> src/main/scala/com/github/dianacarroll/ where you can put your multiple
>> .scala files which will then have to be a part of a package
>> com.github.dianacarroll (you can just put that as your first line in each
>> of these scala files). I am new to Java/Scala so this is how I do it. More
>> educated Java/Scala programmers may tell you otherwise ;)
>>
>> You can get more complicated with the sbt project subrirectory but you
>> can read independently about sbt and what it can do, above is the bare
>> minimum.
>>
>> Let me know if that helped.
>> Ognen
>>
>>
>> On 3/24/14, 2:44 PM, Diana Carroll wrote:
>>
>>> Has anyone successfully followed the instructions on the Quick Start
>>> page of the Spark home page to run a "standalone" Scala application?  I
>>> can't, and I figure I must be missing something obvious!
>>>
>>> I'm trying to follow the instructions here as close to "word for word"
>>> as possible:
>>>
>>> http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>>>
>>> 1.  The instructions don't say what directory to create my test
>>> application in, but later I'm instructed to run "sbt/sbt" so I conclude
>>> that my working directory must be $SPARK_HOME.  (Temporarily ignoring that
>>> it is a little weird to be working directly in the Spark distro.)
>>>
>>> 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.
>>>  Copy&paste in the code from the instructions exactly, replacing
>>> YOUR_SPARK_HOME with my spark home path.
>>>
>>> 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copy&paste in the sbt
>>> file from the instructions
>>>
>>> 4.  From the $SPARK_HOME I run "sbt/sbt package".  It runs through the
>>> ENTIRE Spark project!  This takes several minutes, and at the end, it says
>>> "Done packaging".  unfortunately, there's nothing in the
>>> $SPARK_HOME/mysparktest/ folder other than what I already had there.
>>>
>>> (Just for fun, I also did what I thought was more logical, which is set
>>> my working directory to $SPARK_HOME/mysparktest, and but
>>> 

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.

- Patrick

On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
 wrote:
> There is no direct way to get this in pyspark, but you can get it from the
> underlying java rdd. For example
>
> a = sc.parallelize([1,2,3,4], 2)
> a._jrdd.splits().size()
>
>
> On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
>  wrote:
>>
>> Mark,
>>
>> This appears to be a Scala-only feature. :(
>>
>> Patrick,
>>
>> Are we planning to add this to PySpark?
>>
>> Nick
>>
>>
>> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra 
>> wrote:
>>>
>>> It's much simpler: rdd.partitions.size
>>>
>>>
>>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
>>>  wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 
 View this message in context: How many partitions is my RDD split into?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Soumya Simanta
@Diana - you can set sbt manually for your project by following the
instructions here.

http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html

Manual 
Installation¶

Manual installation requires downloading
sbt-launch.jar
and
creating a script to start it.
Unix¶

Put 
sbt-launch.jar
 in ~/bin.

Create a script to run the jar, by creating ~/bin/sbt with these contents:

SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled
-XX:MaxPermSize=256M"java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar
"$@"

Make the script executable:

$ chmod u+x ~/bin/sbt



On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll wrote:

> Yeah, that's exactly what I did. Unfortunately it doesn't work:
>
> $SPARK_HOME/sbt/sbt package
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> install sbt manually from http://www.scala-sbt.org/
>
>
>
> On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski <
> og...@plainvanillagames.com> wrote:
>
>>  You can use any sbt on your machine, including the one that comes with
>> spark. For example, try:
>>
>> ~/path_to_spark/sbt/sbt compile
>> ~/path_to_spark/sbt/sbt run 
>>
>> Or you can just add that to your PATH by:
>>
>> export $PATH=$PATH:~/path_to_spark/sbt
>>
>> To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile
>> or ??? depending on the system you are using. If you are on Windows, sorry,
>> I can't offer any help there ;)
>>
>> Ognen
>>
>>
>> On 3/24/14, 3:16 PM, Diana Carroll wrote:
>>
>> Thanks Ongen.
>>
>>  Unfortunately I'm not able to follow your instructions either.  In
>> particular:
>>
>>>
>>> sbt compile
>>> sbt run 
>>
>>
>>  This doesn't work for me because there's no program on my path called
>> "sbt".  The instructions in the Quick Start guide are specific that I
>> should call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my
>> system called "sbt".
>>
>>  Did you download and install sbt separately?  In following the Quick
>> Start guide, that was not stated as a requirement, and I'm trying to run
>> through the guide word for word.
>>
>>  Diana
>>
>>
>>  On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski <
>> og...@plainvanillagames.com> wrote:
>>
>>> Diana,
>>>
>>> Anywhere on the filesystem you have read/write access (you need not be
>>> in your spark home directory):
>>>
>>> mkdir myproject
>>> cd myproject
>>> mkdir project
>>> mkdir target
>>> mkdir -p src/main/scala
>>> cp $mypath/$mymysource.scala src/main/scala/
>>> cp $mypath/myproject.sbt .
>>>
>>> Make sure that myproject.sbt has the following in it:
>>>
>>> name := "I NEED A NAME!"
>>>
>>> version := "I NEED A VERSION!"
>>>
>>> scalaVersion := "2.10.3"
>>>
>>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>>> "0.9.0-incubating"
>>>
>>> If you will be using Hadoop/HDFS functionality you will need the below
>>> line also
>>>
>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
>>>
>>> The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are
>>> using 0.8.1 - adjust appropriately.
>>>
>>> That's it. Now you can do
>>>
>>> sbt compile
>>> sbt run 
>>>
>>> You can also do
>>> sbt package to produce a jar file of your code which you can then add to
>>> the SparkContext at runtime.
>>>
>>> In a more complicated project you may need to have a bit more involved
>>> hierarchy like com.github.dianacarroll which will then translate to
>>> src/main/scala/com/github/dianacarroll/ where you can put your multiple
>>> .scala files which will then have to be a part of a package
>>> com.github.dianacarroll (you can just put that as your first line in each
>>> of these scala files). I am new to Java/Scala so this is how I do it. More
>>> educated Java/Scala programmers may tell you otherwise ;)
>>>
>>> You can get more complicated with the sbt project subrirectory but you
>>> can read independently about sbt and what it can do, above is the bare
>>> minimum.
>>>
>>> Let me know if that helped.
>>> Ognen
>>>
>>>
>>> On 3/24/14, 2:44 PM, Diana Carroll wrote:
>>>
 Has anyone successfully followed the instructions on the Quick Start
 page of the Spark home page to run a "standalone" Scala application?  I
 can't, and I figure I must be missing something obvious!

 I'm trying to follow the instructions here as close to "word for word"
 as possible:

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das
Hi Ankur,

Given enough memory and proper caching, I don't understand why is this the
case?

GraphX may actually be slower when Spark is configured to launch many tasks
per machine, because shuffle communication between Spark tasks on the same
machine still occurs by reading and writing from disk, while GraphLab uses
shared memory for same-machine communication

Could you please elaborate more on it ?

Thanks.
Deb



On Mon, Mar 24, 2014 at 1:01 PM, Ankur Dave  wrote:

> Hi Niko,
>
> The GraphX team recently wrote a longer paper with more benchmarks and
> optimizations: http://arxiv.org/abs/1402.2394
>
> Regarding the performance of GraphX vs. GraphLab, I believe GraphX
> currently outperforms GraphLab only in end-to-end benchmarks of pipelines
> involving both graph-parallel operations (e.g. PageRank) and data-parallel
> operations (e.g. ETL and data cleaning). This is due to the overhead of
> moving data between GraphLab and a data-parallel system like Spark. There's
> an example of a pipeline in Section 5.2 in the linked paper, and the
> results are in Figure 10 on page 11.
>
> GraphX has a very similar architecture as GraphLab, so I wouldn't expect
> it to have better performance on pure graph algorithms. GraphX may actually
> be slower when Spark is configured to launch many tasks per machine,
> because shuffle communication between Spark tasks on the same machine still
> occurs by reading and writing from disk, while GraphLab uses shared memory
> for same-machine communication.
>
> I've CC'd Joey and Reynold as well.
>
> Ankur 
>
> On Mar 24, 2014 11:00 AM, "Niko Stahl"  wrote:
>
>> I'm interested in extending the comparison between GraphX and GraphLab
>> presented in Xin et. al (2013). The evaluation presented there is rather
>> limited as it only compares the frameworks for one algorithm (PageRank) on
>> a cluster with a fixed number of nodes. Are there any graph algorithms
>> where one might expect GraphX to perform better than GraphLab? Do you
>> expect the scaling properties (i.e. performance as a function of # of
>> worker nodes) to differ?
>>
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
Ah crud, I guess you are right, I am using the sbt I installed manually 
with my Scala installation.


Well, here is what you can do:
mkdir ~/bin
cd ~/bin
wget 
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.1/sbt-launch.jar

vi sbt

Put the following contents into your new file:

SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled 
-XX:MaxPermSize=256M"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"

:wq!

chmod u+x sbt

Now you can do ~/bin/sbt compile ~/bin/sbt package, run etc.

Ognen

On 3/24/14, 3:30 PM, Diana Carroll wrote:

Yeah, that's exactly what I did. Unfortunately it doesn't work:

$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' 
for reading (No such file or directory)

Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. 
Please install sbt manually from http://www.scala-sbt.org/




On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski 
mailto:og...@plainvanillagames.com>> wrote:


You can use any sbt on your machine, including the one that comes
with spark. For example, try:

~/path_to_spark/sbt/sbt compile
~/path_to_spark/sbt/sbt run 

Or you can just add that to your PATH by:

export $PATH=$PATH:~/path_to_spark/sbt

To make it permanent, you can add it to your ~/.bashrc or
~/.bash_profile or ??? depending on the system you are using. If
you are on Windows, sorry, I can't offer any help there ;)

Ognen


On 3/24/14, 3:16 PM, Diana Carroll wrote:

Thanks Ongen.

Unfortunately I'm not able to follow your instructions either.
 In particular:


sbt compile
sbt run 


This doesn't work for me because there's no program on my path
called "sbt".  The instructions in the Quick Start guide are
specific that I should call "$SPARK_HOME/sbt/sbt".  I don't have
any other executable on my system called "sbt".

Did you download and install sbt separately?  In following the
Quick Start guide, that was not stated as a requirement, and I'm
trying to run through the guide word for word.

Diana


On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski
mailto:og...@plainvanillagames.com>> wrote:

Diana,

Anywhere on the filesystem you have read/write access (you
need not be in your spark home directory):

mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .

Make sure that myproject.sbt has the following in it:

name := "I NEED A NAME!"

version := "I NEED A VERSION!"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" % "spark-core_2.10"
% "0.9.0-incubating"

If you will be using Hadoop/HDFS functionality you will need
the below line also

libraryDependencies += "org.apache.hadoop" % "hadoop-client"
% "2.2.0"

The above assumes you are using Spark 0.9 and Scala 2.10.3.
If you are using 0.8.1 - adjust appropriately.

That's it. Now you can do

sbt compile
sbt run 

You can also do
sbt package to produce a jar file of your code which you can
then add to the SparkContext at runtime.

In a more complicated project you may need to have a bit more
involved hierarchy like com.github.dianacarroll which will
then translate to src/main/scala/com/github/dianacarroll/
where you can put your multiple .scala files which will then
have to be a part of a package com.github.dianacarroll (you
can just put that as your first line in each of these scala
files). I am new to Java/Scala so this is how I do it. More
educated Java/Scala programmers may tell you otherwise ;)

You can get more complicated with the sbt project
subrirectory but you can read independently about sbt and
what it can do, above is the bare minimum.

Let me know if that helped.
Ognen


On 3/24/14, 2:44 PM, Diana Carroll wrote:

Has anyone successfully followed the instructions on the
Quick Start page of the Spark home page to run a
"standalone" Scala application?  I can't, and I figure I
must be missing something obvious!

I'm trying to follow the instructions here as close to
"word for word" as possible:

http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala

1.  The instructions don't say what directory to create
my test application in, but later I'm instructed to run
"sbt/sbt" 

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
(since like other folks I had sbt preinstalled on my "usual" machine)

I ran the command exactly as Ognen suggested and see
Set current project to Simple Project (do you see this -- you should
at least be seeing this)
and then a bunch of Resolving ...

messages. I did get an error there, saying it can't find
javax.servlet.orbit. I googled the error and found this thread:

http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E

adding the IvyXML fragment they suggested helped in my case (but
again, the build pretty clearly complained).

If you're still having no luck, I suggest installing sbt and setting
SBT_HOME... http://www.scala-sbt.org/

In either case though, it's not a Spark-specific issue...Hopefully
some of all this helps.

On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll  wrote:
> Yeah, that's exactly what I did. Unfortunately it doesn't work:
>
> $SPARK_HOME/sbt/sbt package
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> install sbt manually from http://www.scala-sbt.org/
>
>
>
> On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
>  wrote:
>>
>> You can use any sbt on your machine, including the one that comes with
>> spark. For example, try:
>>
>> ~/path_to_spark/sbt/sbt compile
>> ~/path_to_spark/sbt/sbt run 
>>
>> Or you can just add that to your PATH by:
>>
>> export $PATH=$PATH:~/path_to_spark/sbt
>>
>> To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile
>> or ??? depending on the system you are using. If you are on Windows, sorry,
>> I can't offer any help there ;)
>>
>> Ognen
>>
>>
>> On 3/24/14, 3:16 PM, Diana Carroll wrote:
>>
>> Thanks Ongen.
>>
>> Unfortunately I'm not able to follow your instructions either.  In
>> particular:
>>>
>>>
>>> sbt compile
>>> sbt run 
>>
>>
>> This doesn't work for me because there's no program on my path called
>> "sbt".  The instructions in the Quick Start guide are specific that I should
>> call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my system
>> called "sbt".
>>
>> Did you download and install sbt separately?  In following the Quick Start
>> guide, that was not stated as a requirement, and I'm trying to run through
>> the guide word for word.
>>
>> Diana
>>
>>
>> On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski
>>  wrote:
>>>
>>> Diana,
>>>
>>> Anywhere on the filesystem you have read/write access (you need not be in
>>> your spark home directory):
>>>
>>> mkdir myproject
>>> cd myproject
>>> mkdir project
>>> mkdir target
>>> mkdir -p src/main/scala
>>> cp $mypath/$mymysource.scala src/main/scala/
>>> cp $mypath/myproject.sbt .
>>>
>>> Make sure that myproject.sbt has the following in it:
>>>
>>> name := "I NEED A NAME!"
>>>
>>> version := "I NEED A VERSION!"
>>>
>>> scalaVersion := "2.10.3"
>>>
>>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>>> "0.9.0-incubating"
>>>
>>> If you will be using Hadoop/HDFS functionality you will need the below
>>> line also
>>>
>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
>>>
>>> The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are
>>> using 0.8.1 - adjust appropriately.
>>>
>>> That's it. Now you can do
>>>
>>> sbt compile
>>> sbt run 
>>>
>>> You can also do
>>> sbt package to produce a jar file of your code which you can then add to
>>> the SparkContext at runtime.
>>>
>>> In a more complicated project you may need to have a bit more involved
>>> hierarchy like com.github.dianacarroll which will then translate to
>>> src/main/scala/com/github/dianacarroll/ where you can put your multiple
>>> .scala files which will then have to be a part of a package
>>> com.github.dianacarroll (you can just put that as your first line in each of
>>> these scala files). I am new to Java/Scala so this is how I do it. More
>>> educated Java/Scala programmers may tell you otherwise ;)
>>>
>>> You can get more complicated with the sbt project subrirectory but you
>>> can read independently about sbt and what it can do, above is the bare
>>> minimum.
>>>
>>> Let me know if that helped.
>>> Ognen
>>>
>>>
>>> On 3/24/14, 2:44 PM, Diana Carroll wrote:

 Has anyone successfully followed the instructions on the Quick Start
 page of the Spark home page to run a "standalone" Scala application?  I
 can't, and I figure I must be missing something obvious!

 I'm trying to follow the instructions here as close to "word for word"
 as possible:

 http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala

>>>

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Hi, Diana,   

You don’t need to use spark-distributed sbt

just download sbt from its official website and set your PATH to the right place

Best,  

--  
Nan Zhu



On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote:

> Yeah, that's exactly what I did. Unfortunately it doesn't work:
>  
> $SPARK_HOME/sbt/sbt package
> awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
> reading (No such file or directory)
> Attempting to fetch sbt
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
> directory
> /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
> directory
> Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
> install sbt manually from http://www.scala-sbt.org/
>  
>  
>  
>  
> On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski  (mailto:og...@plainvanillagames.com)> wrote:
> > You can use any sbt on your machine, including the one that comes with 
> > spark. For example, try:
> >  
> > ~/path_to_spark/sbt/sbt compile
> > ~/path_to_spark/sbt/sbt run 
> >  
> > Or you can just add that to your PATH by:
> >  
> > export $PATH=$PATH:~/path_to_spark/sbt
> >  
> > To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile 
> > or ??? depending on the system you are using. If you are on Windows, sorry, 
> > I can't offer any help there ;)
> >  
> > Ognen
> >  
> >  
> > On 3/24/14, 3:16 PM, Diana Carroll wrote:
> > > Thanks Ongen.  
> > >  
> > > Unfortunately I'm not able to follow your instructions either.  In 
> > > particular:
> > > >  
> > > > sbt compile
> > > > sbt run 
> > >  
> > > This doesn't work for me because there's no program on my path called 
> > > "sbt".  The instructions in the Quick Start guide are specific that I 
> > > should call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on 
> > > my system called "sbt".  
> > >  
> > > Did you download and install sbt separately?  In following the Quick 
> > > Start guide, that was not stated as a requirement, and I'm trying to run 
> > > through the guide word for word.  
> > >  
> > > Diana  
> > >  
> > >  
> > > On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski 
> > > mailto:og...@plainvanillagames.com)> wrote:
> > > > Diana,
> > > >  
> > > > Anywhere on the filesystem you have read/write access (you need not be 
> > > > in your spark home directory):
> > > >  
> > > > mkdir myproject
> > > > cd myproject
> > > > mkdir project
> > > > mkdir target
> > > > mkdir -p src/main/scala
> > > > cp $mypath/$mymysource.scala src/main/scala/
> > > > cp $mypath/myproject.sbt .
> > > >  
> > > > Make sure that myproject.sbt has the following in it:
> > > >  
> > > > name := "I NEED A NAME!"
> > > >  
> > > > version := "I NEED A VERSION!"
> > > >  
> > > > scalaVersion := "2.10.3"
> > > >  
> > > > libraryDependencies += "org.apache.spark" % "spark-core_2.10" % 
> > > > "0.9.0-incubating"
> > > >  
> > > > If you will be using Hadoop/HDFS functionality you will need the below 
> > > > line also
> > > >  
> > > > libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
> > > >  
> > > > The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are 
> > > > using 0.8.1 - adjust appropriately.
> > > >  
> > > > That's it. Now you can do
> > > >  
> > > > sbt compile
> > > > sbt run 
> > > >  
> > > > You can also do
> > > > sbt package to produce a jar file of your code which you can then add 
> > > > to the SparkContext at runtime.
> > > >  
> > > > In a more complicated project you may need to have a bit more involved 
> > > > hierarchy like com.github.dianacarroll which will then translate to 
> > > > src/main/scala/com/github/dianacarroll/ where you can put your multiple 
> > > > .scala files which will then have to be a part of a package 
> > > > com.github.dianacarroll (you can just put that as your first line in 
> > > > each of these scala files). I am new to Java/Scala so this is how I do 
> > > > it. More educated Java/Scala programmers may tell you otherwise ;)
> > > >  
> > > > You can get more complicated with the sbt project subrirectory but you 
> > > > can read independently about sbt and what it can do, above is the bare 
> > > > minimum.
> > > >  
> > > > Let me know if that helped.
> > > > Ognen  
> > > >  
> > > >  
> > > > On 3/24/14, 2:44 PM, Diana Carroll wrote:
> > > > > Has anyone successfully followed the instructions on the Quick Start 
> > > > > page of the Spark home page to run a "standalone" Scala application?  
> > > > > I can't, and I figure I must be missing something obvious!
> > > > >  
> > > > > I'm trying to follow the instructions here as close to "word for 
> > > > > word" as possible:
> > > > > http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
> > > > >  
> > > > > 1.  The instructions don't say what directory to create my test 
> > > > > application in, but later I'm instructed to run "sbt/sbt" so I 
> > > > > conclude that my working directory mu

Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt") 
supposed to work? Meaning, can I save files to the HDFS fs this way?


I tried:

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.saveAsTextFile("hdfs://ip:port/path/file.txt")

and it is just hanging. At the same time on my HDFS it created file.txt 
but as a directory which has subdirectories (the final one is empty).


Thanks!
Ognen


Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski

Hmm. Strange. Even the below hangs.

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.count

I then looked at the web UI at port 8080 and realized that the spark 
shell is in WAITING status since another job is running on the 
standalone cluster. This may sound like a very stupid question but my 
expectation would be that I can submit multiple jobs at the same time 
and there would be some kind of a fair strategy to run them in turn. 
What Spark (basics) have a slept through? :)


Thanks!
Ognen

On 3/24/14, 4:00 PM, Ognen Duzlevski wrote:
Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt") 
supposed to work? Meaning, can I save files to the HDFS fs this way?


I tried:

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.saveAsTextFile("hdfs://ip:port/path/file.txt")

and it is just hanging. At the same time on my HDFS it created 
file.txt but as a directory which has subdirectories (the final one is 
empty).


Thanks!
Ognen

--
“No matter what they ever do to us, we must always act for the love of our 
people and the earth. We must not react out of hatred against those who have no 
sense.”
-- John Trudell


Re: question about partitions

2014-03-24 Thread Walrus theCat
For instance, I need to work with an RDD in terms of N parts.  Will calling
RDD.coalesce(N) possibly cause processing bottlenecks?


On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote:

> Hi,
>
> Quick question about partitions.  If my RDD is partitioned into 5
> partitions, does that mean that I am constraining it to exist on at most 5
> machines?
>
> Thanks
>


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks for your help, everyone.  Several folks have explained that I can
surely solve the problem by installing sbt.

But I'm trying to get the instructions working *as written on the Spark
website*.  The instructions not only don't have you install sbt
separately...they actually specifically have you use the sbt that is
distributed with Spark.

If it is not possible to build your own Spark programs with
Spark-distributed sbt, then that's a big hole in the Spark docs that I
shall file.  And if the sbt that is included with Spark is MEANT to be able
to compile your own Spark apps, then that's a product bug.

But before I file the bug, I'm still hoping I'm missing something, and
someone will point out that I'm missing a small step that will make the
Spark distribution of sbt work!

Diana



On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska wrote:

> Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
> (since like other folks I had sbt preinstalled on my "usual" machine)
>
> I ran the command exactly as Ognen suggested and see
> Set current project to Simple Project (do you see this -- you should
> at least be seeing this)
> and then a bunch of Resolving ...
>
> messages. I did get an error there, saying it can't find
> javax.servlet.orbit. I googled the error and found this thread:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E
>
> adding the IvyXML fragment they suggested helped in my case (but
> again, the build pretty clearly complained).
>
> If you're still having no luck, I suggest installing sbt and setting
> SBT_HOME... http://www.scala-sbt.org/
>
> In either case though, it's not a Spark-specific issue...Hopefully
> some of all this helps.
>
> On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll 
> wrote:
> > Yeah, that's exactly what I did. Unfortunately it doesn't work:
> >
> > $SPARK_HOME/sbt/sbt package
> > awk: cmd. line:1: fatal: cannot open file `./project/build.properties'
> for
> > reading (No such file or directory)
> > Attempting to fetch sbt
> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > directory
> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > directory
> > Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> > install sbt manually from http://www.scala-sbt.org/
> >
> >
> >
> > On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
> >  wrote:
> >>
> >> You can use any sbt on your machine, including the one that comes with
> >> spark. For example, try:
> >>
> >> ~/path_to_spark/sbt/sbt compile
> >> ~/path_to_spark/sbt/sbt run 
> >>
> >> Or you can just add that to your PATH by:
> >>
> >> export $PATH=$PATH:~/path_to_spark/sbt
> >>
> >> To make it permanent, you can add it to your ~/.bashrc or
> ~/.bash_profile
> >> or ??? depending on the system you are using. If you are on Windows,
> sorry,
> >> I can't offer any help there ;)
> >>
> >> Ognen
> >>
> >>
> >> On 3/24/14, 3:16 PM, Diana Carroll wrote:
> >>
> >> Thanks Ongen.
> >>
> >> Unfortunately I'm not able to follow your instructions either.  In
> >> particular:
> >>>
> >>>
> >>> sbt compile
> >>> sbt run 
> >>
> >>
> >> This doesn't work for me because there's no program on my path called
> >> "sbt".  The instructions in the Quick Start guide are specific that I
> should
> >> call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my
> system
> >> called "sbt".
> >>
> >> Did you download and install sbt separately?  In following the Quick
> Start
> >> guide, that was not stated as a requirement, and I'm trying to run
> through
> >> the guide word for word.
> >>
> >> Diana
> >>
> >>
> >> On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski
> >>  wrote:
> >>>
> >>> Diana,
> >>>
> >>> Anywhere on the filesystem you have read/write access (you need not be
> in
> >>> your spark home directory):
> >>>
> >>> mkdir myproject
> >>> cd myproject
> >>> mkdir project
> >>> mkdir target
> >>> mkdir -p src/main/scala
> >>> cp $mypath/$mymysource.scala src/main/scala/
> >>> cp $mypath/myproject.sbt .
> >>>
> >>> Make sure that myproject.sbt has the following in it:
> >>>
> >>> name := "I NEED A NAME!"
> >>>
> >>> version := "I NEED A VERSION!"
> >>>
> >>> scalaVersion := "2.10.3"
> >>>
> >>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
> >>> "0.9.0-incubating"
> >>>
> >>> If you will be using Hadoop/HDFS functionality you will need the below
> >>> line also
> >>>
> >>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"
> >>>
> >>> The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are
> >>> using 0.8.1 - adjust appropriately.
> >>>
> >>> That's it. Now you can do
> >>>
> >>> sbt compile
> >>> sbt run 
> >>>
> >>> You can also do
> >>> sbt package to produce a jar file of your code which you can then add
> to
> >>> the SparkContext at runtime.
> >>>
> >>> In a more complicated project you may need to have a bit mo

Re: question about partitions

2014-03-24 Thread Syed A. Hashmi
RDD.coalesce should be fine for rebalancing data across all RDD partitions.
Coalesce is pretty handy in situations where you have sparse data and want
to compact it (e.g. data after applying a strict filter) OR you know the
magic number of partitions according to your cluster which will be optimal.

One point to watch out though is that if N is greater than your current
partitions, you need to pass shuffle=true to coalesce. If N is less than
your current partitions (i.e. you are shrinking partitions, do not set
shuffle=true, otherwise it will cause additional unnecessary shuffle
overhead.


On Mon, Mar 24, 2014 at 2:32 PM, Walrus theCat wrote:

> For instance, I need to work with an RDD in terms of N parts.  Will
> calling RDD.coalesce(N) possibly cause processing bottlenecks?
>
>
> On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote:
>
>> Hi,
>>
>> Quick question about partitions.  If my RDD is partitioned into 5
>> partitions, does that mean that I am constraining it to exist on at most 5
>> machines?
>>
>> Thanks
>>
>
>


Re: spark executor/driver log files management

2014-03-24 Thread Tathagata Das
You can use RollingFileAppenders in log4j.properties.
http://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html

You can have other scripts delete old logs.

TD


On Mon, Mar 24, 2014 at 12:20 AM, Sourav Chandra <
sourav.chan...@livestream.com> wrote:

> Hi,
>
> I have few questions regarding log file management in spark:
>
> 1. Currently I did not find any way to modify the lof file name for
> executor/drivers). Its hardcoded as stdout and stderr. Also there is no log
> rotation.
>
> In case of streaming application this will grow forever and become
> unmanageable. Is there any way to overcome this?
>
> Thanks,
> --
>
> Sourav Chandra
>
> Senior Software Engineer
>
> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>
> sourav.chan...@livestream.com
>
> o: +91 80 4121 8723
>
> m: +91 988 699 3746
>
> skype: sourav.chandra
>
> Livestream
>
> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
> Block, Koramangala Industrial Area,
>
> Bangalore 560034
>
> www.livestream.com
>


Re: Writing RDDs to HDFS

2014-03-24 Thread Diana Carroll
Ongen:

I don't know why your process is hanging, sorry.  But I do know that the
way saveAsTextFile works is that you give it a path to a directory, not a
file.  The "file" is saved in multiple parts, corresponding to the
partitions. (part-0, part-1 etc.)

(Presumably it does this because it allows each partition to be saved on
the local disk, to minimize network traffic.  It's how Hadoop works, too.)




On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski
wrote:

> Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
> supposed to work? Meaning, can I save files to the HDFS fs this way?
>
> I tried:
>
> val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
> r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>
> and it is just hanging. At the same time on my HDFS it created file.txt
> but as a directory which has subdirectories (the final one is empty).
>
> Thanks!
> Ognen
>


Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put 
to put things as files into the filesystem (and sc.textFile to get stuff 
out of them in Spark) and I see that they appear to be saved as files 
that are replicated across 3 out of the 16 nodes in the hdfs cluster 
(which is my case is also my Spark cluster) -> hence, I was puzzled why 
a directory this time. What you are saying makes sense, I suppose. As 
for the hanging - I am curious about that myself.


Ognen

On 3/24/14, 5:01 PM, Diana Carroll wrote:

Ongen:

I don't know why your process is hanging, sorry.  But I do know that 
the way saveAsTextFile works is that you give it a path to a 
directory, not a file.  The "file" is saved in multiple parts, 
corresponding to the partitions. (part-0, part-1 etc.)


(Presumably it does this because it allows each partition to be saved 
on the local disk, to minimize network traffic.  It's how Hadoop 
works, too.)





On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski 
mailto:og...@nengoiksvelzud.com>> wrote:


Is
someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
supposed to work? Meaning, can I save files to the HDFS fs this way?

I tried:

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.saveAsTextFile("hdfs://ip:port/path/file.txt")

and it is just hanging. At the same time on my HDFS it created
file.txt but as a directory which has subdirectories (the final
one is empty).

Thanks!
Ognen




Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls 
at the same point in each case.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Mar 23, 2014, at 9:56 PM, Matei Zaharia  wrote:

> Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
> SparkContext? It tries to serialize that many objects together at a time, 
> which might be too much. By default the batchSize is 1024.
> 
> Matei
> 
> On Mar 23, 2014, at 10:11 AM, Jeremy Freeman  wrote:
> 
>> Hi all,
>> 
>> Hitting a mysterious error loading large text files, specific to PySpark
>> 0.9.0.
>> 
>> In PySpark 0.8.1, this works:
>> 
>> data = sc.textFile("path/to/myfile")
>> data.count()
>> 
>> But in 0.9.0, it stalls. There are indications of completion up to:
>> 
>> 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
>> (progress: 15/537)
>> 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
>> 
>> And then this repeats indefinitely
>> 
>> 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
>> runningTasks: 144
>> 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
>> runningTasks: 144
>> 
>> Always stalls at the same place. There's nothing in stderr on the workers,
>> but in stdout there are several of these messages:
>> 
>> INFO PythonRDD: stdin writer to Python finished early
>> 
>> So perhaps the real error is being suppressed as in
>> https://spark-project.atlassian.net/browse/SPARK-1025
>> 
>> Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
>> characters per row. Running on a private cluster with 10 nodes, 100GB / 16
>> cores each, Python v 2.7.6.
>> 
>> I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
>> and in PySpark in 0.8.1. Happy to post the file, but it should repro for
>> anything with these dimensions. It *might* be specific to long strings: I
>> don't see it with fewer characters (10k) per row, but I also don't see it
>> with many fewer rows but the same number of characters per row.
>> 
>> Happy to try and provide more info / help debug!
>> 
>> -- Jeremy
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 



Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das
Hello Sanjay,

Yes, your understanding of lazy semantics is correct. But ideally
every batch should read based on the batch interval provided in the
StreamingContext. Can you open a JIRA on this?

On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani
 wrote:
> Hi All,
>
> I found out why this problem exists. Consider the following scenario:
> - a DStream is created from any source. (I've checked with file and socket)
> - No actions are applied to this DStream
> - Sliding Window operation is applied to this DStream and an action is
> applied to the sliding window.
> In this case, Spark will not even read the input stream in the batch in
> which the sliding interval isn't a multiple of batch interval. Put another
> way, it won't read the input when it doesn't have to apply the window
> function. This is happening because all transformations in Spark are lazy.
>
> How to fix this or workaround it (see line#3):
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
> Duration(1 * 60 * 1000));
> JavaDStream inputStream = stcObj.textFileStream("/Input");
> inputStream.print(); // This is the workaround
> JavaDStream objWindow = inputStream.window(new
> Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
> objWindow.dstream().saveAsTextFiles("/Output", "");
>
>
> The "Window operations" example on the streaming guide implies that Spark
> will read the stream in every batch, which is not happening because of the
> lazy transformations.
> Wherever sliding window would be used, in most of the cases, no actions will
> be taken on the pre-window batch, hence my gut feeling was that Streaming
> would read every batch if any actions are being taken in the windowed
> stream.
>
> Regards,
> Sanjay
>
>
> On Friday, 21 March 2014 8:06 PM, Sanjay Awatramani 
> wrote:
> Hi,
>
> I want to run a map/reduce process over last 5 seconds of data, every 4
> seconds. This is quite similar to the sliding window pictorial example under
> Window Operations section on
> http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html
> .
>
> The RDDs returned by window transformation function are incorrect in my
> case. To investigate this further, I ran a series of examples with varying
> values of window length & slide interval. Summary of the test results:
> (window length, slide interval) -> result
> (3,1) -> success
> (4,2) -> success
> (3,2) -> fail
> (4,3) -> fail
> (5,4) -> fail
> (5,2) -> fail
>
> The only condition mentioned in the doc is that the two values(5 & 4) should
> be multiples of batch interval(1 in my case) and obviously, I get a run time
> error if I attempt to violate this condition. Looking at my results, it
> seems that failures result when the slide interval isn't a multiple of
> window length.
>
> My code:
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
> Duration(1 * 60 * 1000));
> JavaDStream inputStream = stcObj.textFileStream("/Input");
> JavaDStream objWindow = inputStream.window(new
> Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
> objWindow.dstream().saveAsTextFiles("/Output", "");
>
> Detailed results:
> (3,1) -> success
> @t_0: [inputStream's RDD@t_0]
> @t_1: [inputStream's RDD@t_0,1]
> @t_2: [inputStream's RDD@t_0,1,2]
> @t_3: [inputStream's RDD@t_1,2,3]
> @t_4: [inputStream's RDD@t_2,3,4]
> @t_5: [inputStream's RDD@t_3,4,5]
>
> (4,2) -> success
> @t_0: nothing
> @t_1: [inputStream's RDD@t_0,1]
> @t_2: nothing
> @t_3: [inputStream's RDD@t_0,1,2,3]
> @t_4: nothing
> @t_5: [inputStream's RDD@t_2,3,4,5]
>
> (3,2) -> fail
> @t_0: nothing
> @t_1: [inputStream's RDD@t_0,1]
> @t_2: nothing
> @t_3: [inputStream's RDD@t_2,3]//(expected RDD@t_1,2,3)
> @t_4: nothing
> @t_5: [inputStream's RDD@t_4,5]//(expected RDD@t_3,4,5)
>
> (4,3) -> fail
> @t_0: nothing
> @t_1: nothing
> @t_2: [inputStream's RDD@t_0,1,2]
> @t_3: nothing
> @t_4: nothing
> @t_5: [inputStream's RDD@t_3,4,5]//(expected RDD@t_2,3,4,5)
>
> (5,4) -> fail
> @t_0: nothing
> @t_1: nothing
> @t_2: nothing
> @t_3: [inputStream's RDD@t_0,1,2,3]
> @t_4: nothing
> @t_5: nothing
> @t_6: nothing
> @t_7: [inputStream's RDD@t_4,5,6,7]//(expected RDD@t_3,4,5,6,7)
>
> (5,2) -> fail
> @t_0: nothing
> @t_1: [inputStream's RDD@t_0,1]
> @t_2: nothing
> @t_3: [inputStream's RDD@t_0,1,2,3]
> @t_4: nothing
> @t_5: [inputStream's RDD@t_2,3,4,5]//(expected RDD@t_1,2,3,4,5)
> @t_6: nothing
> @t_7: [inputStream's RDD@t_4,5,6,7]//(expected RDD@t_3,4,5,6,7)
>
> I have run all the above examples twice to be sure !
> I believe either my understanding of sliding window mechanism is incorrect
> or there is a problem in the sliding window mechanism.
>
> Regards,
> Sanjay
>
>


Re: [bug?] streaming window unexpected behaviour

2014-03-24 Thread Tathagata Das
Yes, I believe that is current behavior. Essentially, the first few
RDDs will be partial windows (assuming window duration > sliding
interval).

TD


On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu
 wrote:
> I have what I would call unexpected behaviour when using window on a stream.
>
> I have 2 windowed streams with a 5s batch interval. One window stream is
> (5s,5s)=smallWindow and the other (10s,5s)=bigWindow
>
> What I've noticed is that the 1st RDD produced by bigWindow is incorrect and
> is of the size 5s not 10s. So instead of waiting 10s and producing 1 RDD
> with size 10s, Spark produced the 1st 10s RDD of size 5s.
>
> Why is this happening? To me it looks like a bug; Matei or TD can you verify
> that this is correct behaviour?
>
>
>
>
>
> I have the following code
>
> val ssc = new StreamingContext(conf, Seconds(5))
>
>
>
> val smallWindowStream = ssc.queueStream(smallWindowRddQueue)
>
> val bigWindowStream = ssc.queueStream(bigWindowRddQueue)
>
>
>
> val smallWindow = smallWindowReshapedStream.window(Seconds(5), Seconds(5))
>
>   .reduceByKey((t1, t2) => (t1._1, t1._2, t1._3 + t2._3))
>
> val bigWindow = bigWindowReshapedStream.window(Seconds(10), Seconds(5))
>
> .reduceByKey((t1, t2) => (t1._1, t1._2, t1._3 + t2._3))
>
>
>
> -Adrian
>
>


Re: question about partitions

2014-03-24 Thread Walrus theCat
Syed,

Thanks for the tip.  I'm not sure if coalesce is doing what I'm intending
to do, which is, in effect, to subdivide the RDD into N parts (by calling
coalesce and doing operations on the partitions.)  It sounds like, however,
this won't bottleneck my processing power.  If this sets off any alarms for
anyone, feel free to chime in.


On Mon, Mar 24, 2014 at 2:50 PM, Syed A. Hashmi wrote:

> RDD.coalesce should be fine for rebalancing data across all RDD
> partitions. Coalesce is pretty handy in situations where you have sparse
> data and want to compact it (e.g. data after applying a strict filter) OR
> you know the magic number of partitions according to your cluster which
> will be optimal.
>
> One point to watch out though is that if N is greater than your current
> partitions, you need to pass shuffle=true to coalesce. If N is less than
> your current partitions (i.e. you are shrinking partitions, do not set
> shuffle=true, otherwise it will cause additional unnecessary shuffle
> overhead.
>
>
> On Mon, Mar 24, 2014 at 2:32 PM, Walrus theCat wrote:
>
>> For instance, I need to work with an RDD in terms of N parts.  Will
>> calling RDD.coalesce(N) possibly cause processing bottlenecks?
>>
>>
>> On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote:
>>
>>> Hi,
>>>
>>> Quick question about partitions.  If my RDD is partitioned into 5
>>> partitions, does that mean that I am constraining it to exist on at most 5
>>> machines?
>>>
>>> Thanks
>>>
>>
>>
>


Re: Comparing GraphX and GraphLab

2014-03-24 Thread Niko Stahl
Hi Ankur, hi Deb,

Thanks for the information and for the reference to the recent paper. I
understand that GraphLab is highly optimized for graph algorithms and
consistently outperforms GraphX for graph related tasks. I'd like to
further evaluate the cost of moving data between Spark and some other graph
processing framework (e.g. GraphLab). The paper touches on this briefly
citing serialization, replication and disk I/0 as the main factors.

Do you have any suggestions on how to further investigate the impact of
these factors? For example, I suppose the impact of replication depends on
cluster size and HDFS configuration. Your help is greatly appreciated.

Best,
Niko


On Mon, Mar 24, 2014 at 8:35 PM, Debasish Das wrote:

> Hi Ankur,
>
> Given enough memory and proper caching, I don't understand why is this the
> case?
>
> GraphX may actually be slower when Spark is configured to launch many
> tasks per machine, because shuffle communication between Spark tasks on the
> same machine still occurs by reading and writing from disk, while GraphLab
> uses shared memory for same-machine communication
>
> Could you please elaborate more on it ?
>
>  Thanks.
> Deb
>
>
>
> On Mon, Mar 24, 2014 at 1:01 PM, Ankur Dave  wrote:
>
>> Hi Niko,
>>
>> The GraphX team recently wrote a longer paper with more benchmarks and
>> optimizations: http://arxiv.org/abs/1402.2394
>>
>> Regarding the performance of GraphX vs. GraphLab, I believe GraphX
>> currently outperforms GraphLab only in end-to-end benchmarks of pipelines
>> involving both graph-parallel operations (e.g. PageRank) and data-parallel
>> operations (e.g. ETL and data cleaning). This is due to the overhead of
>> moving data between GraphLab and a data-parallel system like Spark. There's
>> an example of a pipeline in Section 5.2 in the linked paper, and the
>> results are in Figure 10 on page 11.
>>
>> GraphX has a very similar architecture as GraphLab, so I wouldn't expect
>> it to have better performance on pure graph algorithms. GraphX may actually
>> be slower when Spark is configured to launch many tasks per machine,
>> because shuffle communication between Spark tasks on the same machine still
>> occurs by reading and writing from disk, while GraphLab uses shared memory
>> for same-machine communication.
>>
>> I've CC'd Joey and Reynold as well.
>>
>> Ankur 
>>
>> On Mar 24, 2014 11:00 AM, "Niko Stahl"  wrote:
>>
>>> I'm interested in extending the comparison between GraphX and GraphLab
>>> presented in Xin et. al (2013). The evaluation presented there is rather
>>> limited as it only compares the frameworks for one algorithm (PageRank) on
>>> a cluster with a fixed number of nodes. Are there any graph algorithms
>>> where one might expect GraphX to perform better than GraphLab? Do you
>>> expect the scaling properties (i.e. performance as a function of # of
>>> worker nodes) to differ?
>>>
>>
>


Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Sweet! That's simple enough.

Here's a JIRA ticket to track adding this to PySpark for the future:

https://spark-project.atlassian.net/browse/SPARK-1308

Nick


On Mon, Mar 24, 2014 at 4:29 PM, Patrick Wendell  wrote:

> Ah we should just add this directly in pyspark - it's as simple as the
> code Shivaram just wrote.
>
> - Patrick
>
> On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
>  wrote:
> > There is no direct way to get this in pyspark, but you can get it from
> the
> > underlying java rdd. For example
> >
> > a = sc.parallelize([1,2,3,4], 2)
> > a._jrdd.splits().size()
> >
> >
> > On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
> >  wrote:
> >>
> >> Mark,
> >>
> >> This appears to be a Scala-only feature. :(
> >>
> >> Patrick,
> >>
> >> Are we planning to add this to PySpark?
> >>
> >> Nick
> >>
> >>
> >> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra  >
> >> wrote:
> >>>
> >>> It's much simpler: rdd.partitions.size
> >>>
> >>>
> >>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> >>>  wrote:
> 
>  Hey there fellow Dukes of Data,
> 
>  How can I tell how many partitions my RDD is split into?
> 
>  I'm interested in knowing because, from what I gather, having a good
>  number of partitions is good for performance. If I'm looking to
> understand
>  how my pipeline is performing, say for a parallelized write out to
> HDFS,
>  knowing how many partitions an RDD has would be a good thing to check.
> 
>  Is that correct?
> 
>  I could not find an obvious method or property to see how my RDD is
>  partitioned. Instead, I devised the following thingy:
> 
>  def f(idx, itr): yield idx
> 
>  rdd = sc.parallelize([1, 2, 3, 4], 4)
>  rdd.mapPartitionsWithIndex(f).count()
> 
>  Frankly, I'm not sure what I'm doing here, but this seems to give me
> the
>  answer I'm looking for. Derp. :)
> 
>  So in summary, should I care about how finely my RDDs are partitioned?
>  And how would I check on that?
> 
>  Nick
> 
> 
>  
>  View this message in context: How many partitions is my RDD split
> into?
>  Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>>
> >>
> >
>


Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Just so I can close this thread (in case anyone else runs into this 
stuff) - I did sleep through the basics of Spark ;). The answer on why 
my job is in waiting state (hanging) is here: 
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling


Ognen

On 3/24/14, 5:01 PM, Diana Carroll wrote:

Ongen:

I don't know why your process is hanging, sorry.  But I do know that 
the way saveAsTextFile works is that you give it a path to a 
directory, not a file.  The "file" is saved in multiple parts, 
corresponding to the partitions. (part-0, part-1 etc.)


(Presumably it does this because it allows each partition to be saved 
on the local disk, to minimize network traffic.  It's how Hadoop 
works, too.)





On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski 
mailto:og...@nengoiksvelzud.com>> wrote:


Is
someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
supposed to work? Meaning, can I save files to the HDFS fs this way?

I tried:

val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
r.saveAsTextFile("hdfs://ip:port/path/file.txt")

and it is just hanging. At the same time on my HDFS it created
file.txt but as a directory which has subdirectories (the final
one is empty).

Thanks!
Ognen




--
"A distributed system is one in which the failure of a computer you didn't even know 
existed can render your own computer unusable"
-- Leslie Lamport



Re: N-Fold validation and RDD partitions

2014-03-24 Thread Walrus theCat
If someone wanted / needed to implement this themselves, are partitions the
correct way to go?  Any tips on how to get started (say, dividing an RDD
into 5 parts)?



On Fri, Mar 21, 2014 at 9:51 AM, Jaonary Rabarisoa wrote:

> Thank you Hai-Anh. Are the files   CrossValidation.scala and 
> RandomSplitRDD.scala
>  enough to use it ? I'm currently using spark 0.9.0 and I to avoid to
> rebuild every thing.
>
>
>
>
> On Fri, Mar 21, 2014 at 4:58 PM, Hai-Anh Trinh  wrote:
>
>> Hi Jaonary,
>>
>> You can find the code for k-fold CV in
>> https://github.com/apache/incubator-spark/pull/448. I have not find the
>> time to resubmit the pull to latest master.
>>
>>
>> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani > > wrote:
>>
>>> Hi Jaonary,
>>>
>>> I believe the n folds should be mapped into n Keys in spark using a map
>>> function. You can reduce the returned PairRDD and you should get your
>>> metric.
>>> I don't understand partitions fully, but from whatever I understand of
>>> it, they aren't required in your scenario.
>>>
>>> Regards,
>>> Sanjay
>>>
>>>
>>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa 
>>> wrote:
>>>   Hi
>>>
>>> I need to partition my data represented as RDD into n folds and run
>>> metrics computation in each fold and finally compute the means of my
>>> metrics overall the folds.
>>> Does spark can do the data partition out of the box or do I need to
>>> implement it myself. I know that RDD has a partitions method and
>>> mapPartitions but I really don't understand the purpose and the meaning of
>>> partition here.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>
>>
>>  --
>> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
>> http://www.linkedin.com/in/haianh
>>
>>
>


Re: combining operations elegantly

2014-03-24 Thread Richard Siebeling
Hi guys,

thanks for the information, I'll give it a try with Algebird,
thanks again,
Richard

@Patrick, thanks for the release calendar


On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell wrote:

> Hey All,
>
> I think the old thread is here:
> https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
>
> The method proposed in that thread is to create a utility class for
> doing single-pass aggregations. Using Algebird is a pretty good way to
> do this and is a bit more flexible since you don't need to create a
> new utility each time you want to do this.
>
> In Spark 1.0 and later you will be able to do this more elegantly with
> the schema support:
> myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
> Average('duration) as 'duration)
>
> and it will use a single pass automatically... but that's not quite
> released yet :)
>
> - Patrick
>
>
>
>
> On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers  wrote:
> > i currently typically do something like this:
> >
> > scala> val rdd = sc.parallelize(1 to 10)
> > scala> import com.twitter.algebird.Operators._
> > scala> import com.twitter.algebird.{Max, Min}
> > scala> rdd.map{ x => (
> >  |   1L,
> >  |   Min(x),
> >  |   Max(x),
> >  |   x
> >  | )}.reduce(_ + _)
> > res0: (Long, com.twitter.algebird.Min[Int],
> com.twitter.algebird.Max[Int],
> > Int) = (10,Min(1),Max(10),55)
> >
> > however for this you need twitter algebird dependency. without that you
> have
> > to code the reduce function on the tuples yourself...
> >
> > another example with 2 columns, where i do conditional count for first
> > column, and simple sum for second:
> > scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
> >  |   if (x > 5) 1 else 0,
> >  |   y
> >  | )}.reduce(_ + _)
> > res3: (Int, Int) = (5,155)
> >
> >
> >
> > On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling  >
> > wrote:
> >>
> >> Hi Koert, Patrick,
> >>
> >> do you already have an elegant solution to combine multiple operations
> on
> >> a single RDD?
> >> Say for example that I want to do a sum over one column, a count and an
> >> average over another column,
> >>
> >> thanks in advance,
> >> Richard
> >>
> >>
> >> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <
> rsiebel...@gmail.com>
> >> wrote:
> >>>
> >>> Patrick, Koert,
> >>>
> >>> I'm also very interested in these examples, could you please post them
> if
> >>> you find them?
> >>> thanks in advance,
> >>> Richard
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers 
> wrote:
> 
>  not that long ago there was a nice example on here about how to
> combine
>  multiple operations on a single RDD. so basically if you want to do a
>  count() and something else, how to roll them into a single job. i
> think
>  patrick wendell gave the examples.
> 
>  i cant find them anymore patrick can you please repost? thanks!
> >>>
> >>>
> >>
> >
>


Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
Hi,I have large data set of numbers ie RDD and wanted to perform a
computation only on groupof two values  at a time.For
example1,2,3,4,5,6,7... is an RDDCan i group the RDD into
(1,2),(3,4),(5,6)...?? and perform the respective computations ?in an
efficient manner?As we do'nt have a way to index elements directly using
forloop etc..(i,i+1)...is their way to resolve this problem?Please suggest
me ..i would be really thankful to you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
Diana, I think you are correct - I just installed
 wget 
http://mirror.symnds.com/software/Apache/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-cdh4.tgz
and indeed I see the same error that you see

It looks like in previous versions sbt-launch used to just come down
in the package, but now they try to get it for you -- and that code
seems to have some assumptions on where it is being invoked from

On Mon, Mar 24, 2014 at 5:47 PM, Diana Carroll  wrote:
> Thanks for your help, everyone.  Several folks have explained that I can
> surely solve the problem by installing sbt.
>
> But I'm trying to get the instructions working as written on the Spark
> website.  The instructions not only don't have you install sbt
> separately...they actually specifically have you use the sbt that is
> distributed with Spark.
>
> If it is not possible to build your own Spark programs with
> Spark-distributed sbt, then that's a big hole in the Spark docs that I shall
> file.  And if the sbt that is included with Spark is MEANT to be able to
> compile your own Spark apps, then that's a product bug.
>
> But before I file the bug, I'm still hoping I'm missing something, and
> someone will point out that I'm missing a small step that will make the
> Spark distribution of sbt work!
>
> Diana
>
>
>
> On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska 
> wrote:
>>
>> Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
>> (since like other folks I had sbt preinstalled on my "usual" machine)
>>
>> I ran the command exactly as Ognen suggested and see
>> Set current project to Simple Project (do you see this -- you should
>> at least be seeing this)
>> and then a bunch of Resolving ...
>>
>> messages. I did get an error there, saying it can't find
>> javax.servlet.orbit. I googled the error and found this thread:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E
>>
>> adding the IvyXML fragment they suggested helped in my case (but
>> again, the build pretty clearly complained).
>>
>> If you're still having no luck, I suggest installing sbt and setting
>> SBT_HOME... http://www.scala-sbt.org/
>>
>> In either case though, it's not a Spark-specific issue...Hopefully
>> some of all this helps.
>>
>> On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll 
>> wrote:
>> > Yeah, that's exactly what I did. Unfortunately it doesn't work:
>> >
>> > $SPARK_HOME/sbt/sbt package
>> > awk: cmd. line:1: fatal: cannot open file `./project/build.properties'
>> > for
>> > reading (No such file or directory)
>> > Attempting to fetch sbt
>> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
>> > directory
>> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
>> > directory
>> > Our attempt to download sbt locally to sbt/sbt-launch-.jar failed.
>> > Please
>> > install sbt manually from http://www.scala-sbt.org/
>> >
>> >
>> >
>> > On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
>> >  wrote:
>> >>
>> >> You can use any sbt on your machine, including the one that comes with
>> >> spark. For example, try:
>> >>
>> >> ~/path_to_spark/sbt/sbt compile
>> >> ~/path_to_spark/sbt/sbt run 
>> >>
>> >> Or you can just add that to your PATH by:
>> >>
>> >> export $PATH=$PATH:~/path_to_spark/sbt
>> >>
>> >> To make it permanent, you can add it to your ~/.bashrc or
>> >> ~/.bash_profile
>> >> or ??? depending on the system you are using. If you are on Windows,
>> >> sorry,
>> >> I can't offer any help there ;)
>> >>
>> >> Ognen
>> >>
>> >>
>> >> On 3/24/14, 3:16 PM, Diana Carroll wrote:
>> >>
>> >> Thanks Ongen.
>> >>
>> >> Unfortunately I'm not able to follow your instructions either.  In
>> >> particular:
>> >>>
>> >>>
>> >>> sbt compile
>> >>> sbt run 
>> >>
>> >>
>> >> This doesn't work for me because there's no program on my path called
>> >> "sbt".  The instructions in the Quick Start guide are specific that I
>> >> should
>> >> call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my
>> >> system
>> >> called "sbt".
>> >>
>> >> Did you download and install sbt separately?  In following the Quick
>> >> Start
>> >> guide, that was not stated as a requirement, and I'm trying to run
>> >> through
>> >> the guide word for word.
>> >>
>> >> Diana
>> >>
>> >>
>> >> On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski
>> >>  wrote:
>> >>>
>> >>> Diana,
>> >>>
>> >>> Anywhere on the filesystem you have read/write access (you need not be
>> >>> in
>> >>> your spark home directory):
>> >>>
>> >>> mkdir myproject
>> >>> cd myproject
>> >>> mkdir project
>> >>> mkdir target
>> >>> mkdir -p src/main/scala
>> >>> cp $mypath/$mymysource.scala src/main/scala/
>> >>> cp $mypath/myproject.sbt .
>> >>>
>> >>> Make sure that myproject.sbt has the following in it:
>> >>>
>> >>> name := "I NEED A NAME!"
>> >>>
>> >>> version := "I NEED A VERSION!"
>> >>>
>> >>> scalaVersion := "2.10.3"
>> >>>
>> >>> libraryDependencies += "o

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
I found that I never read the document carefully and I never find that Spark 
document is suggesting you to use Spark-distributed sbt……  

Best,

--  
Nan Zhu



On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:

> Thanks for your help, everyone.  Several folks have explained that I can 
> surely solve the problem by installing sbt.
>  
> But I'm trying to get the instructions working as written on the Spark 
> website.  The instructions not only don't have you install sbt 
> separately...they actually specifically have you use the sbt that is 
> distributed with Spark.  
>  
> If it is not possible to build your own Spark programs with Spark-distributed 
> sbt, then that's a big hole in the Spark docs that I shall file.  And if the 
> sbt that is included with Spark is MEANT to be able to compile your own Spark 
> apps, then that's a product bug.  
>  
> But before I file the bug, I'm still hoping I'm missing something, and 
> someone will point out that I'm missing a small step that will make the Spark 
> distribution of sbt work!
>  
> Diana
>  
>  
>  
> On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska  (mailto:yana.kadiy...@gmail.com)> wrote:
> > Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
> > (since like other folks I had sbt preinstalled on my "usual" machine)
> >  
> > I ran the command exactly as Ognen suggested and see
> > Set current project to Simple Project (do you see this -- you should
> > at least be seeing this)
> > and then a bunch of Resolving ...
> >  
> > messages. I did get an error there, saying it can't find
> > javax.servlet.orbit. I googled the error and found this thread:
> >  
> > http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E
> >  
> > adding the IvyXML fragment they suggested helped in my case (but
> > again, the build pretty clearly complained).
> >  
> > If you're still having no luck, I suggest installing sbt and setting
> > SBT_HOME... http://www.scala-sbt.org/
> >  
> > In either case though, it's not a Spark-specific issue...Hopefully
> > some of all this helps.
> >  
> > On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll  > (mailto:dcarr...@cloudera.com)> wrote:
> > > Yeah, that's exactly what I did. Unfortunately it doesn't work:
> > >
> > > $SPARK_HOME/sbt/sbt package
> > > awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
> > > reading (No such file or directory)
> > > Attempting to fetch sbt
> > > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > > directory
> > > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > > directory
> > > Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> > > install sbt manually from http://www.scala-sbt.org/
> > >
> > >
> > >
> > > On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
> > > mailto:og...@plainvanillagames.com)> wrote:
> > >>
> > >> You can use any sbt on your machine, including the one that comes with
> > >> spark. For example, try:
> > >>
> > >> ~/path_to_spark/sbt/sbt compile
> > >> ~/path_to_spark/sbt/sbt run 
> > >>
> > >> Or you can just add that to your PATH by:
> > >>
> > >> export $PATH=$PATH:~/path_to_spark/sbt
> > >>
> > >> To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile
> > >> or ??? depending on the system you are using. If you are on Windows, 
> > >> sorry,
> > >> I can't offer any help there ;)
> > >>
> > >> Ognen
> > >>
> > >>
> > >> On 3/24/14, 3:16 PM, Diana Carroll wrote:
> > >>
> > >> Thanks Ongen.
> > >>
> > >> Unfortunately I'm not able to follow your instructions either.  In
> > >> particular:
> > >>>
> > >>>
> > >>> sbt compile
> > >>> sbt run 
> > >>
> > >>
> > >> This doesn't work for me because there's no program on my path called
> > >> "sbt".  The instructions in the Quick Start guide are specific that I 
> > >> should
> > >> call "$SPARK_HOME/sbt/sbt".  I don't have any other executable on my 
> > >> system
> > >> called "sbt".
> > >>
> > >> Did you download and install sbt separately?  In following the Quick 
> > >> Start
> > >> guide, that was not stated as a requirement, and I'm trying to run 
> > >> through
> > >> the guide word for word.
> > >>
> > >> Diana
> > >>
> > >>
> > >> On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski
> > >> mailto:og...@plainvanillagames.com)> wrote:
> > >>>
> > >>> Diana,
> > >>>
> > >>> Anywhere on the filesystem you have read/write access (you need not be 
> > >>> in
> > >>> your spark home directory):
> > >>>
> > >>> mkdir myproject
> > >>> cd myproject
> > >>> mkdir project
> > >>> mkdir target
> > >>> mkdir -p src/main/scala
> > >>> cp $mypath/$mymysource.scala src/main/scala/
> > >>> cp $mypath/myproject.sbt .
> > >>>
> > >>> Make sure that myproject.sbt has the following in it:
> > >>>
> > >>> name := "I NEED A NAME!"
> > >>>
> > >>> version := "I NEED A VERSION!"
> > >>>
> > >>> scalaVersion := "2.10.3"
> > >>>
> > >>> library

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Walrus theCat
I'm also interested in this.


On Mon, Mar 24, 2014 at 4:59 PM, yh18190  wrote:

> Hi, I have large data set of numbers ie RDD and wanted to perform a
> computation only on groupof two values at a time. For example
> 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...??
> and perform the respective computations ?in an efficient manner? As we
> do'nt have a way to index elements directly using forloop etc..(i,i+1)...is
> their way to resolve this problem? Please suggest me ..i would be really
> thankful to you
> --
> View this message in context: Splitting RDD and Grouping together to
> perform 
> computation
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
We need some one who can explain us with short code snippet on given example
so that we get clear cut idea  on RDDs indexing..
Guys please help us



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat
Hi,

sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0
).coalesce(5,true).glom.collect  yields

Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
Array(), Array())

How do I get something more like:

 Array(Array(0), Array(20), Array(40), Array(60), Array(80))

Thanks


Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau
There is also https://github.com/apache/spark/pull/18 against the current
repo which may be easier to apply.


On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh  wrote:

> Hi Jaonary,
>
> You can find the code for k-fold CV in
> https://github.com/apache/incubator-spark/pull/448. I have not find the
> time to resubmit the pull to latest master.
>
>
> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani 
> wrote:
>
>> Hi Jaonary,
>>
>> I believe the n folds should be mapped into n Keys in spark using a map
>> function. You can reduce the returned PairRDD and you should get your
>> metric.
>> I don't understand partitions fully, but from whatever I understand of
>> it, they aren't required in your scenario.
>>
>> Regards,
>> Sanjay
>>
>>
>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa 
>> wrote:
>>   Hi
>>
>> I need to partition my data represented as RDD into n folds and run
>> metrics computation in each fold and finally compute the means of my
>> metrics overall the folds.
>> Does spark can do the data partition out of the box or do I need to
>> implement it myself. I know that RDD has a partitions method and
>> mapPartitions but I really don't understand the purpose and the meaning of
>> partition here.
>>
>>
>>
>> Cheers,
>>
>> Jaonary
>>
>>
>>
>
>
> --
> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
> http://www.linkedin.com/in/haianh
>
>


-- 
Cell : 425-233-8271


Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
It is suggested implicitly in giving you the command "./sbt/sbt". The
separately installed sbt isn't in a folder called sbt, whereas Spark's
version is.  And more relevantly, just a few paragraphs earlier in the
tutorial you execute the command "sbt/sbt assembly" which definitely refers
to the spark install.

On Monday, March 24, 2014, Nan Zhu  wrote:

> I found that I never read the document carefully and I never find that
> Spark document is suggesting you to use Spark-distributed sbt..
>
> Best,
>
> --
> Nan Zhu
>
>
> On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:
>
> Thanks for your help, everyone.  Several folks have explained that I can
> surely solve the problem by installing sbt.
>
> But I'm trying to get the instructions working *as written on the Spark
> website*.  The instructions not only don't have you install sbt
> separately...they actually specifically have you use the sbt that is
> distributed with Spark.
>
> If it is not possible to build your own Spark programs with
> Spark-distributed sbt, then that's a big hole in the Spark docs that I
> shall file.  And if the sbt that is included with Spark is MEANT to be able
> to compile your own Spark apps, then that's a product bug.
>
> But before I file the bug, I'm still hoping I'm missing something, and
> someone will point out that I'm missing a small step that will make the
> Spark distribution of sbt work!
>
> Diana
>
>
>
> On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska wrote:
>
> Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
> (since like other folks I had sbt preinstalled on my "usual" machine)
>
> I ran the command exactly as Ognen suggested and see
> Set current project to Simple Project (do you see this -- you should
> at least be seeing this)
> and then a bunch of Resolving ...
>
> messages. I did get an error there, saying it can't find
> javax.servlet.orbit. I googled the error and found this thread:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E
>
> adding the IvyXML fragment they suggested helped in my case (but
> again, the build pretty clearly complained).
>
> If you're still having no luck, I suggest installing sbt and setting
> SBT_HOME... http://www.scala-sbt.org/
>
> In either case though, it's not a Spark-specific issue...Hopefully
> some of all this helps.
>
> On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll 
> wrote:
> > Yeah, that's exactly what I did. Unfortunately it doesn't work:
> >
> > $SPARK_HOME/sbt/sbt package
> > awk: cmd. line:1: fatal: cannot open file `./project/build.properties'
> for
> > reading (No such file or directory)
> > Attempting to fetch sbt
> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > directory
> > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > directory
> > Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
> > install sbt manually from http://www.scala-sbt.org/
> >
> >
> >
> > On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
> >  wrote:
> >>
> >> You can use any sbt on your machine, including the one that comes with
> >> spark. For example, try:
> >>
> >> ~/path_to_spark/sbt/sbt compile
> >> ~/path_to_spark/sbt/sbt run 
> >>
> >> Or you can just add that to your PATH by:
> >>
> >> export $PATH=$PATH:~/path_to_spark/sbt
> >>
> >> To make it permanent, you can add it to your ~/.bashrc or
> ~/.bash_profile
> >> or ??? depending on the system you are using. If you are on Windows,
> sorry,
> >> I can't offer any help there ;)
> >>
> >> Ognen
> >>
> >>
> >> On 3/24/14, 3:16 PM, Diana Carroll wrote:
> >>
> >> Thanks Ongen.
> >>
> >> Unfortunately I'm not able to follow your instructions either.  In
> >> particular:
> >>>
> >>>
> >>> sbt compile
> >>> sbt run 
> >>
> >>
> >> This doesn't work for me because there's no program on my path called
> >> "sbt".  The instructions in the Quick Start guide are specific that I
> sho
>
>


Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu
partition your input into even number partitions  

use mapPartition to operate on Iterator[Int]

maybe there are some more efficient way….

Best,  

--  
Nan Zhu



On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:

> Hi, I have large data set of numbers ie RDD and wanted to perform a 
> computation only on groupof two values at a time. For example 
> 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? 
> and perform the respective computations ?in an efficient manner? As we do'nt 
> have a way to index elements directly using forloop etc..(i,i+1)...is their 
> way to resolve this problem? Please suggest me ..i would be really thankful 
> to you  
> View this message in context: Splitting RDD and Grouping together to perform 
> computation 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html)
> Sent from the Apache Spark User List mailing list archive 
> (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
> (http://Nabble.com).



Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Yes, actually even for spark, I mostly use the sbt I installed…..so always 
missing this issue….

If you can reproduce the problem with a spark-distribtued sbt…I suggest 
proposing a PR to fix the document, before 0.9.1 is officially released  

Best,  

--  
Nan Zhu



On Monday, March 24, 2014 at 8:34 PM, Diana Carroll wrote:

> It is suggested implicitly in giving you the command "./sbt/sbt". The 
> separately installed sbt isn't in a folder called sbt, whereas Spark's 
> version is.  And more relevantly, just a few paragraphs earlier in the 
> tutorial you execute the command "sbt/sbt assembly" which definitely refers 
> to the spark install.  
>  
> On Monday, March 24, 2014, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > I found that I never read the document carefully and I never find that 
> > Spark document is suggesting you to use Spark-distributed sbt……  
> >  
> > Best,
> >  
> > --  
> > Nan Zhu
> >  
> >  
> >  
> > On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:
> >  
> > > Thanks for your help, everyone.  Several folks have explained that I can 
> > > surely solve the problem by installing sbt.
> > >  
> > > But I'm trying to get the instructions working as written on the Spark 
> > > website.  The instructions not only don't have you install sbt 
> > > separately...they actually specifically have you use the sbt that is 
> > > distributed with Spark.  
> > >  
> > > If it is not possible to build your own Spark programs with 
> > > Spark-distributed sbt, then that's a big hole in the Spark docs that I 
> > > shall file.  And if the sbt that is included with Spark is MEANT to be 
> > > able to compile your own Spark apps, then that's a product bug.  
> > >  
> > > But before I file the bug, I'm still hoping I'm missing something, and 
> > > someone will point out that I'm missing a small step that will make the 
> > > Spark distribution of sbt work!
> > >  
> > > Diana
> > >  
> > >  
> > >  
> > > On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska  
> > > wrote:
> > > > Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
> > > > (since like other folks I had sbt preinstalled on my "usual" machine)
> > > >  
> > > > I ran the command exactly as Ognen suggested and see
> > > > Set current project to Simple Project (do you see this -- you should
> > > > at least be seeing this)
> > > > and then a bunch of Resolving ...
> > > >  
> > > > messages. I did get an error there, saying it can't find
> > > > javax.servlet.orbit. I googled the error and found this thread:
> > > >  
> > > > http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E
> > > >  
> > > > adding the IvyXML fragment they suggested helped in my case (but
> > > > again, the build pretty clearly complained).
> > > >  
> > > > If you're still having no luck, I suggest installing sbt and setting
> > > > SBT_HOME... http://www.scala-sbt.org/
> > > >  
> > > > In either case though, it's not a Spark-specific issue...Hopefully
> > > > some of all this helps.
> > > >  
> > > > On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll  
> > > > wrote:
> > > > > Yeah, that's exactly what I did. Unfortunately it doesn't work:
> > > > >
> > > > > $SPARK_HOME/sbt/sbt package
> > > > > awk: cmd. line:1: fatal: cannot open file 
> > > > > `./project/build.properties' for
> > > > > reading (No such file or directory)
> > > > > Attempting to fetch sbt
> > > > > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > > > > directory
> > > > > /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
> > > > > directory
> > > > > Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. 
> > > > > Please
> > > > > install sbt manually from http://www.scala-sbt.org/
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
> > > > >  wrote:
> > > > >>
> > > > >> You can use any sbt on your machine, including the one that comes 
> > > > >> with
> > > > >> spark. For example, try:
> > > > >>
> > > > >> ~/path_to_spark/sbt/sbt compile
> > > > >> ~/path_to_spark/sbt/sbt run 
> > > > >>
> > > > >> Or you can just add that to your PATH by:
> > > > >>
> > > > >> export $PATH=$PATH:~/path_to_spark/sbt
> > > > >>
> > > > >> To make it permanent, you can add it to your ~/.bashrc or 
> > > > >> ~/.bash_profile
> > > > >> or ??? depending on the system you are using. If you are on Windows, 
> > > > >> sorry,
> > > > >> I can't offer any help there ;)
> > > > >>
> > > > >> Ognen
> > > > >>
> > > > >>
> > > > >> On 3/24/14, 3:16 PM, Diana Carroll wrote:
> > > > >>
> > > > >> Thanks Ongen.
> > > > >>
> > > > >> Unfortunately I'm not able to follow your instructions either.  In
> > > > >> particular:
> > > > >>>
> > > > >>>
> > > > >>> sbt compile
> > > > >>> sbt run 
> > > > >>
> > > > >>
> > > > >> This doesn't work for me because there's no program on my path called
> > > > >> "sbt".  The instructio

Re: Writing RDDs to HDFS

2014-03-24 Thread Yana Kadiyska
Ognen, can you comment if you were actually able to run two jobs
concurrently with just restricting spark.cores.max? I run Shark on the
same cluster and was not able to see a standalone job get in (since
Shark is a "long running" job) until I restricted both spark.cores.max
_and_ spark.executor.memory. Just curious if I did something wrong.

On Mon, Mar 24, 2014 at 7:48 PM, Ognen Duzlevski
 wrote:
> Just so I can close this thread (in case anyone else runs into this stuff) -
> I did sleep through the basics of Spark ;). The answer on why my job is in
> waiting state (hanging) is here:
> http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling
>
>
> Ognen
>
> On 3/24/14, 5:01 PM, Diana Carroll wrote:
>
> Ongen:
>
> I don't know why your process is hanging, sorry.  But I do know that the way
> saveAsTextFile works is that you give it a path to a directory, not a file.
> The "file" is saved in multiple parts, corresponding to the partitions.
> (part-0, part-1 etc.)
>
> (Presumably it does this because it allows each partition to be saved on the
> local disk, to minimize network traffic.  It's how Hadoop works, too.)
>
>
>
>
> On Mon, Mar 24, 2014 at 5:00 PM, Ognen Duzlevski 
> wrote:
>>
>> Is someRDD.saveAsTextFile("hdfs://ip:port/path/final_filename.txt")
>> supposed to work? Meaning, can I save files to the HDFS fs this way?
>>
>> I tried:
>>
>> val r = sc.parallelize(List(1,2,3,4,5,6,7,8))
>> r.saveAsTextFile("hdfs://ip:port/path/file.txt")
>>
>> and it is just hanging. At the same time on my HDFS it created file.txt
>> but as a directory which has subdirectories (the final one is empty).
>>
>> Thanks!
>> Ognen
>
>
>
> --
> "A distributed system is one in which the failure of a computer you didn't
> even know existed can render your own computer unusable"
> -- Leslie Lamport


Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu
I didn’t group the integers, but process them in group of two,   

partition that

scala> val a = sc.parallelize(List(1, 2, 3, 4), 2)  
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at 
:12


process each partition and process elements in the partition in group of 2

scala> a.mapPartitions(p => {val l = p.toList;   
 | val ret = new ListBuffer[Int]
 | for (i <- 0 until l.length by 2) {
 | ret += l(i) + l(i + 1)
 | }
 | ret.toList.iterator
 | }
 | )
res7: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at mapPartitions at 
:16



scala> res7.collect

res10: Array[Int] = Array(3, 7)

Best,

--  
Nan Zhu



On Monday, March 24, 2014 at 8:40 PM, Nan Zhu wrote:

> partition your input into even number partitions  
>  
> use mapPartition to operate on Iterator[Int]
>  
> maybe there are some more efficient way….
>  
> Best,  
>  
> --  
> Nan Zhu
>  
>  
>  
> On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:
>  
> > Hi, I have large data set of numbers ie RDD and wanted to perform a 
> > computation only on groupof two values at a time. For example 
> > 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? 
> > and perform the respective computations ?in an efficient manner? As we 
> > do'nt have a way to index elements directly using forloop etc..(i,i+1)...is 
> > their way to resolve this problem? Please suggest me ..i would be really 
> > thankful to you  
> > View this message in context: Splitting RDD and Grouping together to 
> > perform computation 
> > (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html)
> > Sent from the Apache Spark User List mailing list archive 
> > (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
> > (http://Nabble.com).
>  



Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks again.

> If you use the KMeans implementation from MLlib, the
> initialization stage is done on master, 

The “master” here is the app/driver/spark-shell?

Thanks!

On 25 Mar, 2014, at 1:03 am, Xiangrui Meng  wrote:

> Number of rows doesn't matter much as long as you have enough workers
> to distribute the work. K-means has complexity O(n * d * k), where n
> is number of points, d is the dimension, and k is the number of
> clusters. If you use the KMeans implementation from MLlib, the
> initialization stage is done on master, so a large k would slow down
> the initialization stage. If your data is sparse, the latest change to
> KMeans will help with the speed, depending on how sparse your data is.
> -Xiangrui
> 
> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming  wrote:
>> Thanks, Let me try with a smaller K.
>> 
>> Does the size of the input data matters for the example? Currently I have 
>> 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>> 
>> 
>> 
>> 
>> 
>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:
>> 
>>> K = 50 is certainly a large number for k-means. If there is no
>>> particular reason to have 50 clusters, could you try to reduce it
>>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>>> problems. You should use the KMeans algorithm in mllib clustering for
>>> your problem.
>>> 
>>> -Xiangrui
>>> 
>>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  
>>> wrote:
 Hi,
 
 This is on a 4 nodes cluster each with 32 cores/256GB Ram.
 
 (0.9.0) is deployed in a stand alone mode.
 
 Each worker is configured with 192GB. Spark executor memory is also 192GB.
 
 This is on the first iteration. K=50. Here's the code I use:
 http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
 
 Thanks!
 
 
 
 On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
 
> Hi Tsai,
> 
> Could you share more information about the machine you used and the
> training parameters (runs, k, and iterations)? It can help solve your
> issues. Thanks!
> 
> Best,
> Xiangrui
> 
> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  
> wrote:
>> Hi,
>> 
>> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
>> working.
>> 
>> I have -Dspark.default.parallelism=127 cores (n-1).
>> 
>> CPU/Network/IO is idling across all nodes when this is happening.
>> 
>> And there is nothing particular on the master log file. From the 
>> spark-shell:
>> 
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
>> on executor 2: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
>> 38765155 bytes in 193 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
>> on executor 1: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
>> 38765155 bytes in 96 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
>> on executor 0: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
>> 38765155 bytes in 100 ms
>> 
>> But it stops there for some significant time before any movement.
>> 
>> In the stage detail of the UI, I can see that there are 127 tasks 
>> running but the duration each is at least a few minutes.
>> 
>> I'm working off local storage (not hdfs) and the kmeans data is about 
>> 6.5GB (50M rows).
>> 
>> Is this a normal behaviour?
>> 
>> Thanks!
 
>> 



Re: RDD usage

2014-03-24 Thread hequn cheng
points.foreach(p=>p.y = another_value) will return a new modified RDD.


2014-03-24 18:13 GMT+08:00 Chieh-Yen :

> Dear all,
>
> I have a question about the usage of RDD.
> I implemented a class called AppDataPoint, it looks like:
>
> case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends
> Serializable {
>   var y : Double = input_y
>   var x : Array[Double] = input_x
>   ..
> }
> Furthermore, I created the RDD by the following function.
>
> def parsePoint(line: String): AppDataPoint = {
>   /* Some related works for parsing */
>   ..
> }
>
> Assume the RDD called "points":
>
> val lines = sc.textFile(inputPath, numPartition)
> var points = lines.map(parsePoint _).cache()
>
> The question is that, I tried to modify the value of this RDD, the
> operation is:
>
> points.foreach(p=>p.y = another_value)
>
> The operation is workable.
> There doesn't have any warning or error message showed by the system and
> the results are right.
> I wonder that if the modification for RDD is a correct and in fact
> workable design.
> The usage web said that the RDD is immutable, is there any suggestion?
>
> Thanks a lot.
>
> Chieh-Yen Lin
>


答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, a relative question, is that mean the memory usage will doubled? And 
further more, if the compute function in a rdd is not idempotent, rdd will 
changed during the job running, is that right? 

-原始邮件-
发件人: "hequn cheng" 
发送时间: ‎2014/‎3/‎25 9:35
收件人: "user@spark.apache.org" 
主题: Re: RDD usage

points.foreach(p=>p.y = another_value) will return a new modified RDD. 




2014-03-24 18:13 GMT+08:00 Chieh-Yen :

Dear all,


I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:


case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends 
Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ..
}
Furthermore, I created the RDD by the following function.


def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ..
}


Assume the RDD called "points":


val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()


The question is that, I tried to modify the value of this RDD, the operation is:


points.foreach(p=>p.y = another_value)


The operation is workable.
There doesn't have any warning or error message showed by the system and the 
results are right.
I wonder that if the modification for RDD is a correct and in fact workable 
design.
The usage web said that the RDD is immutable, is there any suggestion?


Thanks a lot.


Chieh-Yen Lin

Re: RDD usage

2014-03-24 Thread Mark Hamstra
No, it won't.  The type of RDD#foreach is Unit, so it doesn't return an
RDD.  The utility of foreach is purely for the side effects it generates,
not for its return value -- and modifying an RDD in place via foreach is
generally not a very good idea.


On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng  wrote:

> points.foreach(p=>p.y = another_value) will return a new modified RDD.
>
>
> 2014-03-24 18:13 GMT+08:00 Chieh-Yen :
>
>  Dear all,
>>
>> I have a question about the usage of RDD.
>> I implemented a class called AppDataPoint, it looks like:
>>
>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>> extends Serializable {
>>   var y : Double = input_y
>>   var x : Array[Double] = input_x
>>   ..
>> }
>> Furthermore, I created the RDD by the following function.
>>
>> def parsePoint(line: String): AppDataPoint = {
>>   /* Some related works for parsing */
>>   ..
>> }
>>
>> Assume the RDD called "points":
>>
>> val lines = sc.textFile(inputPath, numPartition)
>> var points = lines.map(parsePoint _).cache()
>>
>> The question is that, I tried to modify the value of this RDD, the
>> operation is:
>>
>> points.foreach(p=>p.y = another_value)
>>
>> The operation is workable.
>> There doesn't have any warning or error message showed by the system and
>> the results are right.
>> I wonder that if the modification for RDD is a correct and in fact
>> workable design.
>> The usage web said that the RDD is immutable, is there any suggestion?
>>
>> Thanks a lot.
>>
>> Chieh-Yen Lin
>>
>
>


Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui

On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming  wrote:
> Thanks again.
>
>> If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master,
>
> The "master" here is the app/driver/spark-shell?
>
> Thanks!
>
> On 25 Mar, 2014, at 1:03 am, Xiangrui Meng  wrote:
>
>> Number of rows doesn't matter much as long as you have enough workers
>> to distribute the work. K-means has complexity O(n * d * k), where n
>> is number of points, d is the dimension, and k is the number of
>> clusters. If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master, so a large k would slow down
>> the initialization stage. If your data is sparse, the latest change to
>> KMeans will help with the speed, depending on how sparse your data is.
>> -Xiangrui
>>
>> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming  wrote:
>>> Thanks, Let me try with a smaller K.
>>>
>>> Does the size of the input data matters for the example? Currently I have 
>>> 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>>>
>>>
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng  wrote:
>>>
 K = 50 is certainly a large number for k-means. If there is no
 particular reason to have 50 clusters, could you try to reduce it
 to, e.g, 100 or 1000? Also, the example code is not for large-scale
 problems. You should use the KMeans algorithm in mllib clustering for
 your problem.

 -Xiangrui

 On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming  
 wrote:
> Hi,
>
> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>
> (0.9.0) is deployed in a stand alone mode.
>
> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>
> This is on the first iteration. K=50. Here's the code I use:
> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>
> Thanks!
>
>
>
> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:
>
>> Hi Tsai,
>>
>> Could you share more information about the machine you used and the
>> training parameters (runs, k, and iterations)? It can help solve your
>> issues. Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  
>> wrote:
>>> Hi,
>>>
>>> At the reduceBuyKey stage, it takes a few minutes before the tasks 
>>> start working.
>>>
>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>
>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>
>>> And there is nothing particular on the master log file. From the 
>>> spark-shell:
>>>
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
>>> on executor 2: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
>>> 38765155 bytes in 193 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
>>> on executor 1: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
>>> 38765155 bytes in 96 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
>>> on executor 0: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
>>> 38765155 bytes in 100 ms
>>>
>>> But it stops there for some significant time before any movement.
>>>
>>> In the stage detail of the UI, I can see that there are 127 tasks 
>>> running but the duration each is at least a few minutes.
>>>
>>> I'm working off local storage (not hdfs) and the kmeans data is about 
>>> 6.5GB (50M rows).
>>>
>>> Is this a normal behaviour?
>>>
>>> Thanks!
>
>>>
>


Re: 答复: RDD usage

2014-03-24 Thread hequn cheng
First question:
If you save your modified RDD like this:
points.foreach(p=>p.y = another_value).collect() or
points.foreach(p=>p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's
memory.
If you have more transformatins after the map(), the spark will pipelines
all transformations and build a DAG. Very little memory will be used in
this stage and the memory will be free soon.
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You
can only create a new RDD from the existing RDD or from file system.


2014-03-25 9:45 GMT+08:00 林武康 :

>  Hi hequn, a relative question, is that mean the memory usage will
> doubled? And further more, if the compute function in a rdd is not
> idempotent, rdd will changed during the job running, is that right?
>  --
> 发件人: hequn cheng 
> 发送时间: 2014/3/25 9:35
> 收件人: user@spark.apache.org
> 主题: Re: RDD usage
>
> points.foreach(p=>p.y = another_value) will return a new modified RDD.
>
>
> 2014-03-24 18:13 GMT+08:00 Chieh-Yen :
>
>>  Dear all,
>>
>> I have a question about the usage of RDD.
>> I implemented a class called AppDataPoint, it looks like:
>>
>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>> extends Serializable {
>>   var y : Double = input_y
>>   var x : Array[Double] = input_x
>>   ..
>> }
>> Furthermore, I created the RDD by the following function.
>>
>> def parsePoint(line: String): AppDataPoint = {
>>   /* Some related works for parsing */
>>   ..
>> }
>>
>> Assume the RDD called "points":
>>
>> val lines = sc.textFile(inputPath, numPartition)
>> var points = lines.map(parsePoint _).cache()
>>
>> The question is that, I tried to modify the value of this RDD, the
>> operation is:
>>
>> points.foreach(p=>p.y = another_value)
>>
>> The operation is workable.
>> There doesn't have any warning or error message showed by the system and
>> the results are right.
>> I wonder that if the modification for RDD is a correct and in fact
>> workable design.
>> The usage web said that the RDD is immutable, is there any suggestion?
>>
>> Thanks a lot.
>>
>> Chieh-Yen Lin
>>
>
>


答复: 答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, 
firstly, immutable is a feather of rdd but not a solid rule, there are ways to 
change it, for excample, a rdd with non-idempotent "compute" function, though 
it is really a bad design to make that function non-idempotent for 
uncontrollable side-effect. I agree with Mark that foreach can modify the 
elements of a rdd, but we should avoid this because it will effect all the rdds 
generate by this changed rdd , make the whole process inconsistent and unstable.

Some rough opinions on the immutable feature of rdd, full discuss can make it 
more clear. Any ideas?

-原始邮件-
发件人: "hequn cheng" 
发送时间: ‎2014/‎3/‎25 10:40
收件人: "user@spark.apache.org" 
主题: Re: 答复: RDD usage

First question:
If you save your modified RDD like this:
points.foreach(p=>p.y = another_value).collect() or 
points.foreach(p=>p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's memory.
If you have more transformatins after the map(), the spark will pipelines all 
transformations and build a DAG. Very little memory will be used in this stage 
and the memory will be free soon.
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You can 
only create a new RDD from the existing RDD or from file system.



2014-03-25 9:45 GMT+08:00 林武康 :

Hi hequn, a relative question, is that mean the memory usage will doubled? And 
further more, if the compute function in a rdd is not idempotent, rdd will 
changed during the job running, is that right? 


发件人: hequn cheng
发送时间: 2014/3/25 9:35
收件人: user@spark.apache.org
主题: Re: RDD usage


points.foreach(p=>p.y = another_value) will return a new modified RDD. 




2014-03-24 18:13 GMT+08:00 Chieh-Yen :

Dear all,


I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:


case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends 
Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ..
}
Furthermore, I created the RDD by the following function.


def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ..
}


Assume the RDD called "points":


val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()


The question is that, I tried to modify the value of this RDD, the operation is:


points.foreach(p=>p.y = another_value)


The operation is workable.
There doesn't have any warning or error message showed by the system and the 
results are right.
I wonder that if the modification for RDD is a correct and in fact workable 
design.
The usage web said that the RDD is immutable, is there any suggestion?


Thanks a lot.


Chieh-Yen Lin

Re: spark executor/driver log files management

2014-03-24 Thread Sourav Chandra
Hi TD,

I thought about that but was not sure whether this will have any impact in
spark UI/ Executor runner as it redirects stream to stderr/stdout. But
ideally it should not as it will fetch the log record from stderr file
(which is latest)..

Is my understanding correct?

Thanks,
Sourav


On Tue, Mar 25, 2014 at 3:26 AM, Tathagata Das
wrote:

> You can use RollingFileAppenders in log4j.properties.
>
> http://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html
>
> You can have other scripts delete old logs.
>
> TD
>
>
> On Mon, Mar 24, 2014 at 12:20 AM, Sourav Chandra <
> sourav.chan...@livestream.com> wrote:
>
>> Hi,
>>
>> I have few questions regarding log file management in spark:
>>
>> 1. Currently I did not find any way to modify the lof file name for
>> executor/drivers). Its hardcoded as stdout and stderr. Also there is no log
>> rotation.
>>
>> In case of streaming application this will grow forever and become
>> unmanageable. Is there any way to overcome this?
>>
>> Thanks,
>> --
>>
>> Sourav Chandra
>>
>> Senior Software Engineer
>>
>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>
>> sourav.chan...@livestream.com
>>
>> o: +91 80 4121 8723
>>
>> m: +91 988 699 3746
>>
>> skype: sourav.chandra
>>
>> Livestream
>>
>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>> Block, Koramangala Industrial Area,
>>
>> Bangalore 560034
>>
>> www.livestream.com
>>
>
>


-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: coalescing RDD into equally sized partitions

2014-03-24 Thread Matei Zaharia
This happened because they were integers equal to 0 mod 5, and we used the 
default hashCode implementation for integers, which will map them all to 0. 
There’s no API method that will look at the resulting partition sizes and 
rebalance them, but you could use another hash function.

Matei

On Mar 24, 2014, at 5:20 PM, Walrus theCat  wrote:

> Hi,
> 
> sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 
> ).coalesce(5,true).glom.collect  yields
> 
> Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), 
> Array(), Array())
> 
> How do I get something more like:
> 
>  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
> 
> Thanks



Spark Streaming ZeroMQ Java Example

2014-03-24 Thread goofy real
Is there a ZeroMQWordCount Java sample code?

https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala


Re: distinct in data frame in spark

2014-03-24 Thread Andrew Ash
My thought would be to key by the first item in each array, then take just
one array for each key.  Something like the below:

v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b => a).values


On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu  wrote:

> Hi,
>   I have a very simple use case:
>
> I have an rdd as following:
>
> d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]
>
> Now, I want to remove all the duplicates from a column and return the
> remaining frame..
> For example:
> If i want to remove the duplicate based on column 1.
> Then basically I would remove either row 1 or row 2 in my final result..
> because the column 1 of both first and second row is the same element (1)
> .. and hence the duplicate..
> So, a possible result is:
>
> output = [[1,2,3,4],[2,3,4,5]]
>
> How do I do this in spark?
> Thanks
>