Re: [External] Re: no stdout output from worker

2014-03-10 Thread Sourav Chandra
Hi Ranjan,

Whatever code is being passed as closure to spark operations like map,
flatmap, filter etc are part of task

All others are in driver.

Thanks,
Sourav


On Mon, Mar 10, 2014 at 12:03 PM, Sen, Ranjan [USA] sen_ran...@bah.comwrote:

 Hi Patrick

 How do I know which part of the code is in the driver and which in task?
 The structure of my code is as below-

 Š

 Static boolean done=false;
 Š

 Public static void main(..

 ..

 JavaRDDString lines = ..

 ..

 While (!done) {

 ..
 While (..) {

 JavaPairRDDInteger, ListInteger labs1 = labs.map (new PairFunctionŠ );

 !! Here I have System.out.println (A)

 } // inner while

 !! Here I have System.out.println (B)


 If (Š) {
 Done = true;

 !! Also here some System.out.println  (C)

 Break;
 }

 Else {

 If (Š) {

 !! More System.out.println  (D)


 labs = labs.map(Š) ;

 }
 }

 } // outer while

 !! Even more System.out.println  (E)

 } // main

 } //class

 I get the console outputs on the master for (B) and (E). I do not see any
 stdout in the worker node. I find the stdout and stderr in the
 spark/work/appid/0/. I see output
 in stderr but not in stdout.

 I do get all the outputs on the console when I run it in local mode.

 Sorry I am new and may be asking some naïve question but it is really
 confusing to me. Thanks for your help.

 Ranjan

 On 3/9/14, 10:50 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Sen,
 
 Is your code in the driver code or inside one of the tasks?
 
 If it's in the tasks, the place you would expect these to be is in
 stdout file under spark/appid/work/[stdout/stderr]. Are you seeing
 at least stderr logs in that folder? If not then the tasks might not
 be running on the workers machines. If you see stderr but not stdout
 that's a bit of a puzzler since they both go through the same
 mechanism.
 
 - Patrick
 
 On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] sen_ran...@bah.com
 wrote:
  Hi
  I have some System.out.println in my Java code that is working ok in a
 local
  environment. But when I run the same code on a standalone  mode in a EC2
  cluster I do not see them at the worker stdout (in the worker node under
  spark location/work ) or at the driver console. Could you help me
  understand how do I troubleshoot?
 
  Thanks
  Ranjan




-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: Streaming JSON string from REST Api in Spring

2014-03-10 Thread sonyjv
Thanks Mayur for your clarification.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-JSON-string-from-REST-Api-in-Spring-tp2358p2451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


subscribe

2014-03-10 Thread hequn cheng
hi


subscribe

2014-03-10 Thread hequn cheng
hi


Re: subscribe

2014-03-10 Thread He-chien Tsai
send this to 'user-request', not 'user'


2014-03-10 17:32 GMT+08:00 hequn cheng chenghe...@gmail.com:

 hi



Using flume to create stream for spark streaming.

2014-03-10 Thread Ravi Hemnani
Hey,

I am using the following flume flow,

Flume agent 1 consisting of Rabbitmq- source, files- channet, avro- sink
sending data to a slave node of spark cluster. 
Flume agent 2, slave node of spark cluster, consisting of avro- source,
files- channel, now for the sink i tried avro, hdfs, file_roll as sink but
i am not able to read the DStream from any of these where for avro sink
type, i am giving sink address as the same slave node and some other port
and i am asking the spark streaming program to listen to slave node and the
port of the sink defined in the conf of slave node. Thus spark streaming is
giving me no result. 

I am running the program as java -jar jar  on master of the cluster. 

What should be the sink type that should be used on the slave node?
I have stuck on this since two weeks now and i am confused how to approach
this. 

Any help?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-flume-to-create-stream-for-spark-streaming-tp2457.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: [External] Re: no stdout output from worker

2014-03-10 Thread Sen, Ranjan [USA]
Hi Sourav
That makes so much sense. Thanks much.
Ranjan

From: Sourav Chandra 
sourav.chan...@livestream.commailto:sourav.chan...@livestream.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Sunday, March 9, 2014 at 10:37 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: [External] Re: no stdout output from worker

Hi Ranjan,

Whatever code is being passed as closure to spark operations like map, flatmap, 
filter etc are part of task

All others are in driver.

Thanks,
Sourav


On Mon, Mar 10, 2014 at 12:03 PM, Sen, Ranjan [USA] 
sen_ran...@bah.commailto:sen_ran...@bah.com wrote:
Hi Patrick

How do I know which part of the code is in the driver and which in task?
The structure of my code is as below-

Š

Static boolean done=false;
Š

Public static void main(..

..

JavaRDDString lines = ..

..

While (!done) {

..
While (..) {

JavaPairRDDInteger, ListInteger labs1 = labs.map (new PairFunctionŠ );

!! Here I have System.out.println (A)

} // inner while

!! Here I have System.out.println (B)


If (Š) {
Done = true;

!! Also here some System.out.println  (C)

Break;
}

Else {

If (Š) {

!! More System.out.println  (D)


labs = labs.map(Š) ;

}
}

} // outer while

!! Even more System.out.println  (E)

} // main

} //class

I get the console outputs on the master for (B) and (E). I do not see any
stdout in the worker node. I find the stdout and stderr in the
spark/work/appid/0/. I see output
in stderr but not in stdout.

I do get all the outputs on the console when I run it in local mode.

Sorry I am new and may be asking some naïve question but it is really
confusing to me. Thanks for your help.

Ranjan

On 3/9/14, 10:50 PM, Patrick Wendell 
pwend...@gmail.commailto:pwend...@gmail.com wrote:

Hey Sen,

Is your code in the driver code or inside one of the tasks?

If it's in the tasks, the place you would expect these to be is in
stdout file under spark/appid/work/[stdout/stderr]. Are you seeing
at least stderr logs in that folder? If not then the tasks might not
be running on the workers machines. If you see stderr but not stdout
that's a bit of a puzzler since they both go through the same
mechanism.

- Patrick

On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] 
sen_ran...@bah.commailto:sen_ran...@bah.com
wrote:
 Hi
 I have some System.out.println in my Java code that is working ok in a
local
 environment. But when I run the same code on a standalone  mode in a EC2
 cluster I do not see them at the worker stdout (in the worker node under
 spark location/work ) or at the driver console. Could you help me
 understand how do I troubleshoot?

 Thanks
 Ranjan




--

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·


sourav.chan...@livestream.commailto:sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, 
Koramangala Industrial Area,

Bangalore 560034

www.livestream.comhttp://www.livestream.com/



Log Analyze

2014-03-10 Thread Eduardo Costa Alfaia

Hi Guys,
Could anyone help me to understand this piece of log in red? Why is this 
happened?


Thanks

14/03/10 16:55:20 INFO SparkContext: Starting job: first at 
NetworkWordCount.scala:87
14/03/10 16:55:20 INFO JobScheduler: Finished job streaming job 
1394466892000 ms.0 from job set of time 1394466892000 ms
14/03/10 16:55:20 INFO JobScheduler: Total delay: 28.537 s for time 
1394466892000 ms (execution: 4.479 s)
14/03/10 16:55:20 INFO JobScheduler: Starting job streaming job 
1394466893000 ms.0 from job set of time 1394466893000 ms
14/03/10 16:55:20 INFO JobGenerator: Checkpointing graph for time 
1394466892000 ms
14/03/10 16:55:20 INFO DStreamGraph: Updating checkpoint data for time 
1394466892000 ms
14/03/10 16:55:20 INFO DStreamGraph: Updated checkpoint data for time 
1394466892000 ms
14/03/10 16:55:20 INFO CheckpointWriter: Saving checkpoint for time 
1394466892000 ms to file 
'hdfs://computer8:54310/user/root/INPUT/checkpoint-1394466892000'
14/03/10 16:55:20 INFO DAGScheduler: Registering RDD 496 (combineByKey 
at ShuffledDStream.scala:42)
14/03/10 16:55:20 INFO DAGScheduler: Got job 39 (first at 
NetworkWordCount.scala:87) with 1 output partitions (allowLocal=true)
14/03/10 16:55:20 INFO DAGScheduler: Final stage: Stage 77 (first at 
NetworkWordCount.scala:87)

14/03/10 16:55:20 INFO DAGScheduler: Parents of final stage: List(Stage 78)
14/03/10 16:55:20 INFO DAGScheduler: Missing parents: List(Stage 78)
14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed 
input-1-1394466782400 on computer10.ant-net:34062 in memory (size: 5.9 
MB, free: 502.2 MB)
14/03/10 16:55:20 INFO DAGScheduler: Submitting Stage 78 
(MapPartitionsRDD[496] at combineByKey at ShuffledDStream.scala:42), 
which has no missing parents
14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-1-1394466816600 in memory on computer10.ant-net:34062 (size: 4.4 
MB, free: 497.8 MB)
14/03/10 16:55:20 INFO DAGScheduler: Submitting 15 missing tasks from 
Stage 78 (MapPartitionsRDD[496] at combineByKey at ShuffledDStream.scala:42)

14/03/10 16:55:20 INFO TaskSchedulerImpl: Adding task set 78.0 with 15 tasks
14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:9 as TID 539 
on executor 2: computer1.ant-net (PROCESS_LOCAL)
14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:9 as 4144 
bytes in 1 ms
14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:10 as TID 540 
on executor 1: computer10.ant-net (PROCESS_LOCAL)
14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:10 as 4144 
bytes in 0 ms
14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:11 as TID 541 
on executor 0: computer11.ant-net (PROCESS_LOCAL)
14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:11 as 4144 
bytes in 0 ms
14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed 
input-0-1394466874200 on computer1.ant-net:51406 in memory (size: 2.9 
MB, free: 460.0 MB)
14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed 
input-0-1394466874400 on computer1.ant-net:51406 in memory (size: 4.1 
MB, free: 468.2 MB)
14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:12 as TID 542 
on executor 1: computer10.ant-net (PROCESS_LOCAL)
14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:12 as 4144 
bytes in 1 ms

14/03/10 16:55:20 WARN TaskSetManager: Lost TID 540 (task 78.0:10)
14/03/10 16:55:20 INFO CheckpointWriter: Deleting 
hdfs://computer8:54310/user/root/INPUT/checkpoint-1394466892000
14/03/10 16:55:20 INFO CheckpointWriter: Checkpoint for time 
1394466892000 ms saved to file 
'hdfs://computer8:54310/user/root/INPUT/checkpoint-1394466892000', took 
3633 bytes and 93 ms
14/03/10 16:55:20 INFO DStreamGraph: Clearing checkpoint data for time 
1394466892000 ms
14/03/10 16:55:20 INFO DStreamGraph: Cleared checkpoint data for time 
1394466892000 ms
14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed 
input-2-1394466789000 on computer11.ant-net:58332 in memory (size: 3.9 
MB, free: 536.0 MB)

14/03/10 16:55:20 WARN TaskSetManager: Loss was due to java.lang.Exception
java.lang.Exception: Could not compute split, block 
input-2-1394466794200 not found

at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:45)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at 

Unsubscribe

2014-03-10 Thread arjun biswas



Room for rent in Aptos

2014-03-10 Thread arjun biswas
Hello ,

My name is Arjun and i am 30 years old and I was inquiring about the room
ad that you have put up on craigslist in Aptos. I am very much interested
in the room and can move in pretty early . My annual income is around 105K
and I am a software engineer working in the silicon valley for about three
years now . I have a masters in Computer Science from UCI . I have
generally being a resident of the Santa Cruz area for some time now even
though i work in Palo Alto . I guess that tells something about my love for
Santa Cruz .
Please reply me at your earliest . I can also be reached at 6505759206.

Regards
Arjun


Re: Sbt Permgen

2014-03-10 Thread Koert Kuipers
hey sandy, i think that pulreq is not relevant to the 0.9 branch i am using

switching to java 7 for sbt/sbt test made it work. not sure why...


On Sun, Mar 9, 2014 at 11:44 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 There was an issue related to this fixed recently:
 https://github.com/apache/spark/pull/103


 On Sun, Mar 9, 2014 at 8:40 PM, Koert Kuipers ko...@tresata.com wrote:

 edit last line of sbt/sbt, after which i run:
 sbt/sbt test


 On Sun, Mar 9, 2014 at 10:24 PM, Sean Owen so...@cloudera.com wrote:

 How are you specifying these args?
 On Mar 9, 2014 8:55 PM, Koert Kuipers ko...@tresata.com wrote:

 i just checkout out the latest 0.9

 no matter what java options i use in sbt/sbt (i tried -Xmx6G
 -XX:MaxPermSize=2000m -XX:ReservedCodeCacheSize=300m) i keep getting errors
 java.lang.OutOfMemoryError: PermGen space when running the tests.

 curiously i managed to run the tests with the default dependencies, but
 with cdh4.5.0 mr1 dependencies i always hit the dreaded Permgen space 
 issue.

 Any suggestions?






RE: Pig on Spark

2014-03-10 Thread Sameer Tilak
Hi Mayur,We are planning to upgrade our distribution MR1 MR2 (YARN) and the 
goal is to get SPROK set up next month. I will keep you posted. Can you please 
keep me informed about your progress as well.
From: mayur.rust...@gmail.com
Date: Mon, 10 Mar 2014 11:47:56 -0700
Subject: Re: Pig on Spark
To: user@spark.apache.org

Hi Sameer,Did you make any progress on this. My team is also trying it out 
would love to know some detail so progress. Mayur Rustagi


Ph: +1 (760) 203 3257http://www.sigmoidanalytics.com@mayur_rustagi





On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:





Hi Aniket,Many thanks! I will check this out.

Date: Thu, 6 Mar 2014 13:46:50 -0800
Subject: Re: Pig on Spark
From: aniket...@gmail.com


To: user@spark.apache.org; tgraves...@yahoo.com

There is some work to make this work on yarn at 
https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23)


You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find 
out what sort of env variables you need (sorry, I haven't been able to clean 
this up- in-progress). There are few known issues with this, I will work on 
fixing them soon.



Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off 
schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work 
(spork-fix in-progress)


4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires 
SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in 
SparkContext along with udf jars)
~Aniket






On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote:


I had asked a similar question on the dev mailing list a while back (Jan 22nd). 



See the archives: 
http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look 
for spork.






Basically Matei said:



Yup, that was it, though I believe people at Twitter picked it up again 
recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other 
groups, and
if there’s enough of it, maybe we can start another open source repo to track 
it. The work
in that repo you pointed to was done over one week, and already had most of 
Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work 
also calls
the Scala API directly, because it was done before we had a Java API; it should 
be easier
with the Java one.
Tom 
 
 


On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote:






Hi everyone,



We are using to Pig to build our data pipeline. I came across Spork -- Pig on 
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   


Can someone please let me know the status of Spork or any other effort that 
will let us run Pig on Spark? We can significantly benefit by using Spark, but 
we would like to keep using the existing Pig scripts.   
   





  

-- 
...:::Aniket:::... Quetzalco@tl
  

  

Re: [BLOG] Spark on Cassandra w/ Calliope

2014-03-10 Thread Rohit Rai
We are happy that you found Calliope useful and glad we could help.

*Founder  CEO, **Tuplejump, Inc.*

www.tuplejump.com
*The Data Engineering Platform*


On Sat, Mar 8, 2014 at 2:18 AM, Brian O'Neill b...@alumni.brown.edu wrote:


 FWIW - I posted some notes to help people get started quickly with Spark
 on C*.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 (tnx again to Rohit and team for all of their help)

 -brian

 --
 Brian ONeill
 CTO, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42



Java example of using broadcast

2014-03-10 Thread Sen, Ranjan [USA]
Hi Patrick

Yes I get it.

I have a different question now - (changed the sub)

Can anyone point me to a Java example of using broadcast variables?

- Ranjan

From: Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Monday, March 10, 2014 at 1:24 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: [External] Re: no stdout output from worker

Hey Sen,

Suarav is right, and I think all of your print statements are inside of the 
driver program rather than inside of a closure. How are you running your 
program (i.e. what do you run that starts this job)? Where you run the driver 
you should expect to see the output.

- Patrick


On Mon, Mar 10, 2014 at 8:56 AM, Sen, Ranjan [USA] 
sen_ran...@bah.commailto:sen_ran...@bah.com wrote:
Hi Sourav
That makes so much sense. Thanks much.
Ranjan

From: Sourav Chandra 
sourav.chan...@livestream.commailto:sourav.chan...@livestream.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Sunday, March 9, 2014 at 10:37 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: [External] Re: no stdout output from worker

Hi Ranjan,

Whatever code is being passed as closure to spark operations like map, flatmap, 
filter etc are part of task

All others are in driver.

Thanks,
Sourav


On Mon, Mar 10, 2014 at 12:03 PM, Sen, Ranjan [USA] 
sen_ran...@bah.commailto:sen_ran...@bah.com wrote:
Hi Patrick

How do I know which part of the code is in the driver and which in task?
The structure of my code is as below-

Š

Static boolean done=false;
Š

Public static void main(..

..

JavaRDDString lines = ..

..

While (!done) {

..
While (..) {

JavaPairRDDInteger, ListInteger labs1 = labs.map (new PairFunctionŠ );

!! Here I have System.out.println (A)

} // inner while

!! Here I have System.out.println (B)


If (Š) {
Done = true;

!! Also here some System.out.println  (C)

Break;
}

Else {

If (Š) {

!! More System.out.println  (D)


labs = labs.map(Š) ;

}
}

} // outer while

!! Even more System.out.println  (E)

} // main

} //class

I get the console outputs on the master for (B) and (E). I do not see any
stdout in the worker node. I find the stdout and stderr in the
spark/work/appid/0/. I see output
in stderr but not in stdout.

I do get all the outputs on the console when I run it in local mode.

Sorry I am new and may be asking some naïve question but it is really
confusing to me. Thanks for your help.

Ranjan

On 3/9/14, 10:50 PM, Patrick Wendell 
pwend...@gmail.commailto:pwend...@gmail.com wrote:

Hey Sen,

Is your code in the driver code or inside one of the tasks?

If it's in the tasks, the place you would expect these to be is in
stdout file under spark/appid/work/[stdout/stderr]. Are you seeing
at least stderr logs in that folder? If not then the tasks might not
be running on the workers machines. If you see stderr but not stdout
that's a bit of a puzzler since they both go through the same
mechanism.

- Patrick

On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] 
sen_ran...@bah.commailto:sen_ran...@bah.com
wrote:
 Hi
 I have some System.out.println in my Java code that is working ok in a
local
 environment. But when I run the same code on a standalone  mode in a EC2
 cluster I do not see them at the worker stdout (in the worker node under
 spark location/work ) or at the driver console. Could you help me
 understand how do I troubleshoot?

 Thanks
 Ranjan




--

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·


sourav.chan...@livestream.commailto:sourav.chan...@livestream.com

o: +91 80 4121 8723tel:%2B91%2080%204121%208723

m: +91 988 699 3746tel:%2B91%20988%20699%203746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, 
Koramangala Industrial Area,

Bangalore 560034

www.livestream.comhttp://www.livestream.com/




Re: [External] Re: no stdout output from worker

2014-03-10 Thread Patrick Wendell
Hey Sen,

Suarav is right, and I think all of your print statements are inside of the
driver program rather than inside of a closure. How are you running your
program (i.e. what do you run that starts this job)? Where you run the
driver you should expect to see the output.

- Patrick


On Mon, Mar 10, 2014 at 8:56 AM, Sen, Ranjan [USA] sen_ran...@bah.comwrote:

  Hi Sourav
 That makes so much sense. Thanks much.
 Ranjan

   From: Sourav Chandra sourav.chan...@livestream.com
 Reply-To: user@spark.apache.org user@spark.apache.org
 Date: Sunday, March 9, 2014 at 10:37 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Re: [External] Re: no stdout output from worker

   Hi Ranjan,

  Whatever code is being passed as closure to spark operations like map,
 flatmap, filter etc are part of task

  All others are in driver.

  Thanks,
 Sourav


 On Mon, Mar 10, 2014 at 12:03 PM, Sen, Ranjan [USA] sen_ran...@bah.comwrote:

 Hi Patrick

 How do I know which part of the code is in the driver and which in task?
 The structure of my code is as below-

 Š

 Static boolean done=false;
 Š

 Public static void main(..

 ..

 JavaRDDString lines = ..

 ..

 While (!done) {

 ..
 While (..) {

 JavaPairRDDInteger, ListInteger labs1 = labs.map (new PairFunctionŠ
 );

 !! Here I have System.out.println (A)

 } // inner while

 !! Here I have System.out.println (B)


 If (Š) {
 Done = true;

 !! Also here some System.out.println  (C)

 Break;
 }

 Else {

 If (Š) {

 !! More System.out.println  (D)


 labs = labs.map(Š) ;

 }
 }

 } // outer while

 !! Even more System.out.println  (E)

 } // main

 } //class

 I get the console outputs on the master for (B) and (E). I do not see any
 stdout in the worker node. I find the stdout and stderr in the
 spark/work/appid/0/. I see output
 in stderr but not in stdout.

 I do get all the outputs on the console when I run it in local mode.

 Sorry I am new and may be asking some naïve question but it is really
 confusing to me. Thanks for your help.

 Ranjan

 On 3/9/14, 10:50 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Sen,
 
 Is your code in the driver code or inside one of the tasks?
 
 If it's in the tasks, the place you would expect these to be is in
 stdout file under spark/appid/work/[stdout/stderr]. Are you seeing
 at least stderr logs in that folder? If not then the tasks might not
 be running on the workers machines. If you see stderr but not stdout
 that's a bit of a puzzler since they both go through the same
 mechanism.
 
 - Patrick
 
 On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] sen_ran...@bah.com
 wrote:
  Hi
  I have some System.out.println in my Java code that is working ok in a
 local
  environment. But when I run the same code on a standalone  mode in a
 EC2
  cluster I do not see them at the worker stdout (in the worker node
 under
  spark location/work ) or at the driver console. Could you help me
  understand how do I troubleshoot?
 
  Thanks
  Ranjan




  --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com



computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hello all,
i am observing a strange result. i have a computation that i run on a
cached RDD in spark-standalone. it typically takes about 4 seconds.

but when other RDDs that are not relevant to the computation at hand are
cached in memory (in same spark context), the computation takes 40 seconds
or more.

the problem seems to be GC time, which goes from milliseconds to tens of
seconds.

note that my issue is not that memory is full. i have cached about 14G in
RDDs with 66G available across workers for the application. also my
computation did not push any cached RDD out of memory.

any ideas?


Re: computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hey matei,
it happens repeatedly.

we are currently runnning on java 6 with spark 0.9.

i will add -XX:+PrintGCDetails and collect details, and also look into java
7 G1. thanks






On Mon, Mar 10, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Does this happen repeatedly if you keep running the computation, or just
 the first time? It may take time to move these Java objects to the old
 generation the first time you run queries, which could lead to a GC pause
 that also slows down the small queries.

 If you can run with -XX:+PrintGCDetails in your Java options, it would
 also be good to see what percent of each GC generation is used.

 The concurrent mark-and-sweep GC -XX:+UseConcMarkSweepGC or the G1 GC in
 Java 7 (-XX:+UseG1GC) might also avoid these pauses by GCing concurrently
 with your application threads.

 Matei

 On Mar 10, 2014, at 3:18 PM, Koert Kuipers ko...@tresata.com wrote:

 hello all,
 i am observing a strange result. i have a computation that i run on a
 cached RDD in spark-standalone. it typically takes about 4 seconds.

 but when other RDDs that are not relevant to the computation at hand are
 cached in memory (in same spark context), the computation takes 40 seconds
 or more.

 the problem seems to be GC time, which goes from milliseconds to tens of
 seconds.

 note that my issue is not that memory is full. i have cached about 14G in
 RDDs with 66G available across workers for the application. also my
 computation did not push any cached RDD out of memory.

 any ideas?





Re: Too many open files exception on reduceByKey

2014-03-10 Thread Patrick Wendell
Hey Matt,

The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.

You might be able to hack around this by decreasing the number of
reducers but this could have some performance implications for your
job.

In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.

This means you'll have to use fewer reducers (e.g. pass reduceByKey a
number of reducers) or use fewer cores on each machine.

- Patrick

On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
matthew.c.ch...@gmail.com wrote:
 Hi everyone,

 My team (cc'ed in this e-mail) and I are running a Spark reduceByKey
 operation on a cluster of 10 slaves where I don't have the privileges to set
 ulimit -n to a higher number. I'm running on a cluster where ulimit -n
 returns 1024 on each machine.

 When I attempt to run this job with the data originating from a text file,
 stored in an HDFS cluster running on the same nodes as the Spark cluster,
 the job crashes with the message, Too many open files.

 My question is, why are so many files being created, and is there a way to
 configure the Spark context to avoid spawning that many files? I am already
 setting spark.shuffle.consolidateFiles to true.

 I want to repeat - I can't change the maximum number of open file
 descriptors on the machines. This cluster is not owned by me and the system
 administrator is responding quite slowly.

 Thanks,

 -Matt Cheah


How to create RDD from Java in-memory data?

2014-03-10 Thread wallacemann
I would like to construct an RDD from data I already have in memory as POJO
objects.  Is this possible?  For example, is it possible to create an RDD
from IterableString?

I'm running Spark from Java as a stand-alone application.  The JavaWordCount
example runs fine.  In the example, the initial RDD is populated from a text
file.  In my use case, I'm streaming data from a database, but even this is
hidden behind an interface which is essentially IterableString.

What I am doing is so basic that I must not understand something obvious. 
Thanks for any suggestions.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Unsubscribe

2014-03-10 Thread Shalini Singh
Unsubscribe


if there is shark 0.9 build can be download?

2014-03-10 Thread qingyang li
Does anyone know  if there is shark 0.9 build can be download?
if not, when there will be shark 0.9 build?


Re: How to create RDD from Java in-memory data?

2014-03-10 Thread wallacemann
I was right ... I was missing something obvious.  The answer to my question
is to use JavaSparkContext.parallelize which works with ListT or
ListTuple2lt;K,V.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2487.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: [BLOG] Spark on Cassandra w/ Calliope

2014-03-10 Thread abhinav chowdary
+1 that we have been using calliope for few months and its working out
really great for us. Any plans on integrating into spark?
On Mar 10, 2014 1:58 PM, Rohit Rai ro...@tuplejump.com wrote:

 We are happy that you found Calliope useful and glad we could help.

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Sat, Mar 8, 2014 at 2:18 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 FWIW - I posted some notes to help people get started quickly with Spark
 on C*.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 (tnx again to Rohit and team for all of their help)

 -brian

 --
 Brian ONeill
 CTO, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42





Re: Sharing SparkContext

2014-03-10 Thread Mayur Rustagi
Which version of Spark  are you using?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Mon, Mar 10, 2014 at 6:49 PM, abhinav chowdary 
abhinav.chowd...@gmail.com wrote:

 for any one who is interested to know about job server from Ooyala.. we
 started using it recently and been working great so far..
 On Feb 25, 2014 9:23 PM, Ognen Duzlevski og...@nengoiksvelzud.com
 wrote:

  In that case, I must have misunderstood the following (from
 http://spark.incubator.apache.org/docs/0.8.1/job-scheduling.html).
 Apologies. Ognen

 Inside a given Spark application (SparkContext instance), multiple
 parallel jobs can run simultaneously if they were submitted from separate
 threads. By “job”, in this section, we mean a Spark action (e.g. save,
 collect) and any tasks that need to run to evaluate that action. Spark’s
 scheduler is fully thread-safe and supports this use case to enable
 applications that serve multiple requests (e.g. queries for multiple
 users).

 By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is
 divided into “stages” (e.g. map and reduce phases), and the first job gets
 priority on all available resources while its stages have tasks to launch,
 then the second job gets priority, etc. If the jobs at the head of the
 queue don’t need to use the whole cluster, later jobs can start to run
 right away, but if the jobs at the head of the queue are large, then later
 jobs may be delayed significantly.

 Starting in Spark 0.8, it is also possible to configure fair sharing
 between jobs. Under fair sharing, Spark assigns tasks between jobs in a
 “round robin” fashion, so that all jobs get a roughly equal share of
 cluster resources. This means that short jobs submitted while a long job is
 running can start receiving resources right away and still get good
 response times, without waiting for the long job to finish. This mode is
 best for multi-user settings.

 To enable the fair scheduler, simply set the spark.scheduler.mode to FAIR
  before creating a SparkContext:
 On 2/25/14, 12:30 PM, Mayur Rustagi wrote:

 fair scheduler merely reorders tasks .. I think he is looking to run
 multiple pieces of code on a single context on demand from customers...if
 the code  order is decided then fair scheduler will ensure that all tasks
 get equal cluster time :)



  Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
  https://twitter.com/mayur_rustagi



 On Tue, Feb 25, 2014 at 10:24 AM, Ognen Duzlevski 
 og...@nengoiksvelzud.com wrote:

  Doesn't the fair scheduler solve this?
 Ognen


 On 2/25/14, 12:08 PM, abhinav chowdary wrote:

 Sorry for not being clear earlier
 how do you want to pass the operations to the spark context?
 this is partly what i am looking for . How to access the active spark
 context and possible ways to pass operations

  Thanks



  On Tue, Feb 25, 2014 at 10:02 AM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 how do you want to pass the operations to the spark context?


  Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
  https://twitter.com/mayur_rustagi



 On Tue, Feb 25, 2014 at 9:59 AM, abhinav chowdary 
 abhinav.chowd...@gmail.com wrote:

 Hi,
I am looking for ways to share the sparkContext, meaning i need
 to be able to perform multiple operations on the same spark context.

  Below is code of a simple app i am testing

   def main(args: Array[String]) {
 println(Welcome to example application!)

  val sc = new SparkContext(spark://10.128.228.142:7077, Simple
 App)

  println(Spark context created!)

  println(Creating RDD!)

  Now once this context is created i want to access  this to submit
 multiple jobs/operations

  Any help is much appreciated

  Thanks







  --
 Warm Regards
 Abhinav Chowdary







Re: Sharing SparkContext

2014-03-10 Thread abhinav chowdary
0.8.1 we used branch 0.8 and  pull request into our local repo. I remember
we have to deal with few issues but once we are thought that its working
great.
On Mar 10, 2014 6:51 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:

 Which version of Spark  are you using?


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Mar 10, 2014 at 6:49 PM, abhinav chowdary 
 abhinav.chowd...@gmail.com wrote:

 for any one who is interested to know about job server from Ooyala.. we
 started using it recently and been working great so far..
 On Feb 25, 2014 9:23 PM, Ognen Duzlevski og...@nengoiksvelzud.com
 wrote:

  In that case, I must have misunderstood the following (from
 http://spark.incubator.apache.org/docs/0.8.1/job-scheduling.html).
 Apologies. Ognen

 Inside a given Spark application (SparkContext instance), multiple
 parallel jobs can run simultaneously if they were submitted from separate
 threads. By job, in this section, we mean a Spark action (e.g. save,
 collect) and any tasks that need to run to evaluate that action.
 Spark's scheduler is fully thread-safe and supports this use case to enable
 applications that serve multiple requests (e.g. queries for multiple
 users).

 By default, Spark's scheduler runs jobs in FIFO fashion. Each job is
 divided into stages (e.g. map and reduce phases), and the first job gets
 priority on all available resources while its stages have tasks to launch,
 then the second job gets priority, etc. If the jobs at the head of the
 queue don't need to use the whole cluster, later jobs can start to run
 right away, but if the jobs at the head of the queue are large, then later
 jobs may be delayed significantly.

 Starting in Spark 0.8, it is also possible to configure fair sharing
 between jobs. Under fair sharing, Spark assigns tasks between jobs in a
 round robin fashion, so that all jobs get a roughly equal share of
 cluster resources. This means that short jobs submitted while a long job is
 running can start receiving resources right away and still get good
 response times, without waiting for the long job to finish. This mode is
 best for multi-user settings.

 To enable the fair scheduler, simply set the spark.scheduler.mode to
 FAIR before creating a SparkContext:
 On 2/25/14, 12:30 PM, Mayur Rustagi wrote:

 fair scheduler merely reorders tasks .. I think he is looking to run
 multiple pieces of code on a single context on demand from customers...if
 the code  order is decided then fair scheduler will ensure that all tasks
 get equal cluster time :)



  Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
  https://twitter.com/mayur_rustagi



 On Tue, Feb 25, 2014 at 10:24 AM, Ognen Duzlevski 
 og...@nengoiksvelzud.com wrote:

  Doesn't the fair scheduler solve this?
 Ognen


 On 2/25/14, 12:08 PM, abhinav chowdary wrote:

 Sorry for not being clear earlier
 how do you want to pass the operations to the spark context?
 this is partly what i am looking for . How to access the active spark
 context and possible ways to pass operations

  Thanks



  On Tue, Feb 25, 2014 at 10:02 AM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 how do you want to pass the operations to the spark context?


  Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
  https://twitter.com/mayur_rustagi



 On Tue, Feb 25, 2014 at 9:59 AM, abhinav chowdary 
 abhinav.chowd...@gmail.com wrote:

 Hi,
I am looking for ways to share the sparkContext, meaning i
 need to be able to perform multiple operations on the same spark context.

  Below is code of a simple app i am testing

   def main(args: Array[String]) {
 println(Welcome to example application!)

  val sc = new SparkContext(spark://10.128.228.142:7077,
 Simple App)

  println(Spark context created!)

  println(Creating RDD!)

  Now once this context is created i want to access  this to submit
 multiple jobs/operations

  Any help is much appreciated

  Thanks







  --
 Warm Regards
 Abhinav Chowdary








Re: Sharing SparkContext

2014-03-10 Thread Ognen Duzlevski

Are you using it with HDFS? What version of Hadoop? 1.0.4?
Ognen

On 3/10/14, 8:49 PM, abhinav chowdary wrote:


for any one who is interested to know about job server from Ooyala.. 
we started using it recently and been working great so far..


On Feb 25, 2014 9:23 PM, Ognen Duzlevski og...@nengoiksvelzud.com 
mailto:og...@nengoiksvelzud.com wrote:


In that case, I must have misunderstood the following (from
http://spark.incubator.apache.org/docs/0.8.1/job-scheduling.html).
Apologies. Ognen

Inside a given Spark application (SparkContext instance),
multiple parallel jobs can run simultaneously if they were
submitted from separate threads. By job, in this section, we
mean a Spark action (e.g.|save|,|collect|) and any tasks that need
to run to evaluate that action. Spark's scheduler is fully
thread-safe and supports this use case to enable applications that
serve multiple requests (e.g. queries for multiple users).

By default, Spark's scheduler runs jobs in FIFO fashion. Each job
is divided into stages (e.g. map and reduce phases), and the
first job gets priority on all available resources while its
stages have tasks to launch, then the second job gets priority,
etc. If the jobs at the head of the queue don't need to use the
whole cluster, later jobs can start to run right away, but if the
jobs at the head of the queue are large, then later jobs may be
delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair
sharing between jobs. Under fair sharing, Spark assigns tasks
between jobs in a round robin fashion, so that all jobs get a
roughly equal share of cluster resources. This means that short
jobs submitted while a long job is running can start receiving
resources right away and still get good response times, without
waiting for the long job to finish. This mode is best for
multi-user settings.

To enable the fair scheduler, simply set
the|spark.scheduler.mode|to|FAIR|before creating a SparkContext:

On 2/25/14, 12:30 PM, Mayur Rustagi wrote:

fair scheduler merely reorders tasks .. I think he is looking to
run multiple pieces of code on a single context on demand from
customers...if the code  order is decided then fair scheduler
will ensure that all tasks get equal cluster time :)



Mayur Rustagi
Ph: +919632149971 tel:%2B919632149971
h
https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Tue, Feb 25, 2014 at 10:24 AM, Ognen Duzlevski
og...@nengoiksvelzud.com mailto:og...@nengoiksvelzud.com wrote:

Doesn't the fair scheduler solve this?
Ognen


On 2/25/14, 12:08 PM, abhinav chowdary wrote:

Sorry for not being clear earlier
how do you want to pass the operations to the spark context?
this is partly what i am looking for . How to access the
active spark context and possible ways to pass operations

Thanks



On Tue, Feb 25, 2014 at 10:02 AM, Mayur Rustagi
mayur.rust...@gmail.com mailto:mayur.rust...@gmail.com
wrote:

how do you want to pass the operations to the spark context?


Mayur Rustagi
Ph: +919632149971 tel:%2B919632149971
h
https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Tue, Feb 25, 2014 at 9:59 AM, abhinav chowdary
abhinav.chowd...@gmail.com
mailto:abhinav.chowd...@gmail.com wrote:

Hi,
   I am looking for ways to share the
sparkContext, meaning i need to be able to perform
multiple operations on the same spark context.

Below is code of a simple app i am testing

 def main(args: Array[String]) {
println(Welcome to example application!)

val sc = new
SparkContext(spark://10.128.228.142:7077
http://10.128.228.142:7077, Simple App)

println(Spark context created!)

println(Creating RDD!)

Now once this context is created i want to access
 this to submit multiple jobs/operations

Any help is much appreciated

Thanks







-- 
Warm Regards

Abhinav Chowdary







--
Some people, when confronted with a problem, think I know, I'll use regular 
expressions. Now they have two problems.
-- Jamie Zawinski



is spark 0.9.0 HA?

2014-03-10 Thread qingyang li
is spark 0.9.0 HA?   we only have one master server , i think is is not .
so, Does anyone know how to support HA for spark?


Re: Sharing SparkContext

2014-03-10 Thread abhinav chowdary
hdfs 1.0.4 but we primarily use Cassandra + Spark (calliope). I tested it
with both
 Are you using it with HDFS? What version of Hadoop? 1.0.4?
Ognen

On 3/10/14, 8:49 PM, abhinav chowdary wrote:

for any one who is interested to know about job server from Ooyala.. we
started using it recently and been working great so far..
On Feb 25, 2014 9:23 PM, Ognen Duzlevski og...@nengoiksvelzud.com wrote:

  In that case, I must have misunderstood the following (from
 http://spark.incubator.apache.org/docs/0.8.1/job-scheduling.html).
 Apologies. Ognen

 Inside a given Spark application (SparkContext instance), multiple
 parallel jobs can run simultaneously if they were submitted from separate
 threads. By job, in this section, we mean a Spark action (e.g. save,
 collect) and any tasks that need to run to evaluate that action. Spark's
 scheduler is fully thread-safe and supports this use case to enable
 applications that serve multiple requests (e.g. queries for multiple
 users).

 By default, Spark's scheduler runs jobs in FIFO fashion. Each job is
 divided into stages (e.g. map and reduce phases), and the first job gets
 priority on all available resources while its stages have tasks to launch,
 then the second job gets priority, etc. If the jobs at the head of the
 queue don't need to use the whole cluster, later jobs can start to run
 right away, but if the jobs at the head of the queue are large, then later
 jobs may be delayed significantly.

 Starting in Spark 0.8, it is also possible to configure fair sharing
 between jobs. Under fair sharing, Spark assigns tasks between jobs in a
 round robin fashion, so that all jobs get a roughly equal share of
 cluster resources. This means that short jobs submitted while a long job is
 running can start receiving resources right away and still get good
 response times, without waiting for the long job to finish. This mode is
 best for multi-user settings.

 To enable the fair scheduler, simply set the spark.scheduler.mode to FAIR 
 before
 creating a SparkContext:
 On 2/25/14, 12:30 PM, Mayur Rustagi wrote:

 fair scheduler merely reorders tasks .. I think he is looking to run
 multiple pieces of code on a single context on demand from customers...if
 the code  order is decided then fair scheduler will ensure that all tasks
 get equal cluster time :)



  Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
  https://twitter.com/mayur_rustagi



 On Tue, Feb 25, 2014 at 10:24 AM, Ognen Duzlevski 
 og...@nengoiksvelzud.com wrote:

  Doesn't the fair scheduler solve this?
 Ognen


 On 2/25/14, 12:08 PM, abhinav chowdary wrote:

 Sorry for not being clear earlier
 how do you want to pass the operations to the spark context?
 this is partly what i am looking for . How to access the active spark
 context and possible ways to pass operations

  Thanks



  On Tue, Feb 25, 2014 at 10:02 AM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 how do you want to pass the operations to the spark context?


  Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
  https://twitter.com/mayur_rustagi



 On Tue, Feb 25, 2014 at 9:59 AM, abhinav chowdary 
 abhinav.chowd...@gmail.com wrote:

 Hi,
I am looking for ways to share the sparkContext, meaning i need
 to be able to perform multiple operations on the same spark context.

  Below is code of a simple app i am testing

   def main(args: Array[String]) {
 println(Welcome to example application!)

  val sc = new SparkContext(spark://10.128.228.142:7077, Simple
 App)

  println(Spark context created!)

  println(Creating RDD!)

  Now once this context is created i want to access  this to submit
 multiple jobs/operations

  Any help is much appreciated

  Thanks







  --
 Warm Regards
 Abhinav Chowdary





-- 
Some people, when confronted with a problem, think I know, I'll use
regular expressions. Now they have two problems.
-- Jamie Zawinski


Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-10 Thread hequn cheng
have your send spark-env.sh to the slave nodes ?


2014-03-11 6:47 GMT+08:00 Linlin linlin200...@gmail.com:


 Hi,

 I have a java option (-Xss) setting specified in SPARK_JAVA_OPTS in
 spark-env.sh,  noticed after stop/restart the spark cluster, the
 master/worker daemon has the setting being applied, but this setting is not
 being propagated to the executor, my application continue behave the same.
 I
 am not sure if there is a way to specify it through SparkConf? like
 SparkConf.set(), and what is the correct way of setting this up for a
 particular spark application.

 Thank you!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OPTS-not-picked-up-by-the-application-tp2483.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: is spark 0.9.0 HA?

2014-03-10 Thread Aaron Davidson
Spark 0.9.0 does include standalone scheduler HA, but it requires running
multiple masters. The docs are located here:
https://spark.apache.org/docs/0.9.0/spark-standalone.html#high-availability

0.9.0 also includes driver HA (for long-running normal or streaming jobs),
allowing you to submit a driver into the standalone cluster which will be
restarted automatically if it crashes. That doc is on the same page:
https://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster

Please let me know if you have further questions.


On Mon, Mar 10, 2014 at 6:57 PM, qingyang li liqingyang1...@gmail.comwrote:

 is spark 0.9.0 HA?   we only have one master server , i think is is not .
 so, Does anyone know how to support HA for spark?



Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-10 Thread Robin Cjc
The properties in spark-env.sh are machine-specific. so need to specify in
you worker as well. I guess you ask is the System.setproperty(). you can
call it before you initialize your sparkcontext.

Best Regards,
Chen Jingci


On Tue, Mar 11, 2014 at 6:47 AM, Linlin linlin200...@gmail.com wrote:


 Hi,

 I have a java option (-Xss) setting specified in SPARK_JAVA_OPTS in
 spark-env.sh,  noticed after stop/restart the spark cluster, the
 master/worker daemon has the setting being applied, but this setting is not
 being propagated to the executor, my application continue behave the same.
 I
 am not sure if there is a way to specify it through SparkConf? like
 SparkConf.set(), and what is the correct way of setting this up for a
 particular spark application.

 Thank you!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OPTS-not-picked-up-by-the-application-tp2483.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-10 Thread Linlin
my cluster only has 1 node (master/worker). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OPTS-not-picked-up-by-the-application-tp2483p2506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: how to use the log4j for the standalone app

2014-03-10 Thread lihu
Thanks, but I do not to log myself program info, I just do not want spark
output all the info to my console, I want the spark output the log into to
some file which I specified.



On Tue, Mar 11, 2014 at 11:49 AM, Robin Cjc cjcro...@gmail.com wrote:

 Hi lihu,

 you can extends the org.apache.spark.logging class. Then use the function
 like logInfo(). Then will log according to the config in your
 log4j.properties.

 Best Regards,
 Chen Jingci


 On Tue, Mar 11, 2014 at 11:36 AM, lihu lihu...@gmail.com wrote:

 Hi,
I use the spark0.9, and when i run the spark-shell, I can log property
 according the log4j.properties in the SPARK_HOME/conf directory.But when I
 use the standalone app, I do not know how to log it.

   I use the SparkConf to set it, such as:

   *val conf = new SparkConf()*
 *  conf.set(*log4j.configuration*, /home/hadoop/spark/conf/l*
 *og4j.properties**)*

   but it does not work.

   this question maybe simple, but I can not find anything in the web. and
 I think this maybe helpful for many people who do not familiar with spark.





-- 
*Best Wishes!*

*Li Hu(李浒) | Graduate Student*

*Institute for Interdisciplinary Information Sciences(IIIS
http://iiis.tsinghua.edu.cn/)*
*Tsinghua University, China*

*Email: lihu...@gmail.com lihu...@gmail.com*
*Tel  : +86 15120081920*
*Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
http://iiis.tsinghua.edu.cn/zh/lihu/*


Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-10 Thread Linlin
Thanks! 

since my worker is on the same node, -Xss JVM option is for setting thread
maximum stack size, my worker does show this option now.  now I realized I
accidently run the the app run in local mode as I didn't give the master URL
when initializing the spark context,  for local mode, how to pass jvm option
to the app? 


hadoop   17315 1  0 14:56 ?00:02:12
/home/hadoop/ibm-java-x86_64-60/bin/java -cp
:/home/hadoop/spark-0.9.0-incubating/conf:/home/hadoop/spark-0.9.0-incubating/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop1.2.1.jar
-Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
-Xss1024k org.apache.spark.deploy.worker.Worker
spark://hdtest021.svl.ibm.com:7077





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OPTS-not-picked-up-by-the-application-tp2483p2510.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-10 Thread Linlin

Thanks! 

so SPARK_DAEMON_JAVA_OPTS is for worker? and SPARK_JAVA_OPTS is for master?  
I only set SPARK_JAVA_OPTS in spark-env.sh, and the JVM opt is applied to
both master/worker daemon. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OPTS-not-picked-up-by-the-application-tp2483p2511.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.