Re: run spark0.9.1 on yarn with hadoop CDH4

2014-05-15 Thread Arpit Tak
Also try this out , we have already done this ..
It will help you..
http://docs.sigmoidanalytics.com/index.php/Setup_hadoop_2.0.0-cdh4.2.0_and_spark_0.9.0_on_ubuntu_12.04




On Tue, May 6, 2014 at 10:17 PM, Andrew Lee  wrote:

> Please check JAVA_HOME. Usually it should point to /usr/java/default on
> CentOS/Linux.
>
> or FYI: http://stackoverflow.com/questions/1117398/java-home-directory
>
>
> > Date: Tue, 6 May 2014 00:23:02 -0700
> > From: sln-1...@163.com
> > To: u...@spark.incubator.apache.org
> > Subject: run spark0.9.1 on yarn with hadoop CDH4
>
> >
> > Hi all,
> > I have make HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
> which
> > contains the (client side) configuration files for the hadoop cluster.
> > The command to launch the YARN Client which I run is like this:
> >
> > #
> >
> SPARK_JAR=./~/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
> > ./bin/spark-class org.apache.spark.deploy.yarn.Client\--jar
> > examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar\--class
> > org.apache.spark.examples.SparkPi\--args yarn-standalone \--num-workers 3
> > \--master-memory 2g \--worker-memory 2g \--worker-cores 1
> > ./bin/spark-class: line 152: /usr/lib/jvm/java-7-sun/bin/java: No such
> file
> > or directory
> > ./bin/spark-class: line 152: exec: /usr/lib/jvm/java-7-sun/bin/java:
> cannot
> > execute: No such file or directory
> > How to make it runs well?
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/run-spark0-9-1-on-yarn-with-hadoop-CDH4-tp5426.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Is there anything that I need to modify?

2014-05-11 Thread Arpit Tak
Try setting hostname to domain setting in /etc/hosts .
Its not able to resolve ip to hostname
try this ...
localhost  192.168.10.220 CHBM220




On Wed, May 7, 2014 at 12:50 PM, Sophia  wrote:

> [root@CHBM220 spark-0.9.1]#
>
> SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
> ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
> examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
> org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
> --master-memory 2g --worker-memory 2g --worker-cores 1
> 14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at
> CHBM220/192.168.10.220:8032
> Exception in thread "main" java.io.IOException: Failed on local exception:
> com.google.protobuf.InvalidProtocolBufferException: Protocol message
> contained an invalid tag (zero).; Host Details : local host is:
> "CHBM220/192.168.10.220"; destination host is: "CHBM220":8032;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1351)
> at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
> at
>
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
> at
>
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
> at
>
> org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144)
> at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:115)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> message contained an invalid tag (zero).
> at
>
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
> at
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.(RpcHeaderProtos.java:1398)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.(RpcHeaderProtos.java:1362)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487)
> at
>
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
> at
>
> com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
> at
>
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
> at
>
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
> at
>
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)
> [root@CHBM220:spark-0.9.1]#
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: error in mllib lr example code

2014-04-24 Thread Arpit Tak
Also try out these examples, all of them works

http://docs.sigmoidanalytics.com/index.php/MLlib

if you spot any problems in those, let us know.

Regards,
arpit


On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia wrote:

> See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more
> recent build of the docs; if you spot any problems in those, let us know.
>
> Matei
>
> On Apr 23, 2014, at 9:49 AM, Xiangrui Meng  wrote:
>
> > The doc is for 0.9.1. You are running a later snapshot, which added
> > sparse vectors. Try LabeledPoint(parts(0).toDouble,
> > Vectors.dense(parts(1).split(' ').map(x => x.toDouble)). The examples
> > are updated in the master branch. You can also check the examples
> > there. -Xiangrui
> >
> > On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi 
> wrote:
> >>
> >> sorry...added a subject now
> >>
> >> On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi 
> wrote:
> >>>
> >>> I am trying to run the example linear regression code from
> >>>
> >>> http://spark.apache.org/docs/latest/mllib-guide.html
> >>>
> >>> But I am getting the following error...am I missing an import?
> >>>
> >>> code
> >>>
> >>> import org.apache.spark._
> >>>
> >>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD
> >>>
> >>> import org.apache.spark.mllib.regression.LabeledPoint
> >>>
> >>>
> >>> object ModelLR {
> >>>
> >>>  def main(args: Array[String]) {
> >>>
> >>>val sc = new SparkContext(args(0), "SparkLR",
> >>>
> >>>  System.getenv("SPARK_HOME"),
> >>> SparkContext.jarOfClass(this.getClass).toSeq)
> >>>
> >>> // Load and parse the data
> >>>
> >>> val data = sc.textFile("mllib/data/ridge-data/lpsa.data")
> >>>
> >>> val parsedData = data.map { line =>
> >>>
> >>>  val parts = line.split(',')
> >>>
> >>>  LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x =>
> >>> x.toDouble).toArray)
> >>>
> >>> }
> >>>
> >>> ..
> >>>
> >>> }
> >>>
> >>> error
> >>>
> >>> - polymorphic expression cannot be instantiated to expected type;
> found :
> >>> [U >: Double]Array[U] required:
> >>>
> >>> org.apache.spark.mllib.linalg.Vector
> >>>
> >>> - polymorphic expression cannot be instantiated to expected type;
> found :
> >>> [U >: Double]Array[U] required:
> >>>
> >>> org.apache.spark.mllib.linalg.Vector
> >>
> >>
>
>


Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak
Okk fine,

try like this , i tried and it works..
specify spark path also in constructor...
and also
export SPARK_JAVA_OPTS="-Xms300m -Xmx512m -XX:MaxPermSize=1g"

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
   def main(args: Array[String]) {
  val logFile = "/var/log/auth.log" // read any file.
  val sc = new SparkContext("spark://localhost:7077", "Simple App",
"/home/ubuntu/spark-0.9.1-incubating/",
  List("target/scala-2.10/simple-project_2.10-2.0.jar"))
  val tr = sc.textFile(logFile).cache
  tr.take(100).foreach(println)

   }
}

This will work


On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp  wrote:

> hi arpit,
> on spark shell, i can read local file properly,
> but when i use sbt run, error occurs.
> the sbt error message is in the beginning of the thread
>
>
> Arpit Tak-2 wrote
> > Hi,
> >
> > You should be able to read it, file://or file:/// not even required for
> > reading locally , just path is enough..
> > what error message you getting on spark-shell while reading...
> > for local:
> >
> >
> > Also read the same from hdfs file also ...
> > put your README file there and read , it  works in both ways..
> > val a= sc.textFile("hdfs://localhost:54310/t/README.md")
> >
> > also, print stack message of your spark-shell...
> >
> >
> > On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp <
>
> > wxhsdp@
>
> > > wrote:
> >
> >> thanks for your reply, adnan, i tried
> >> val logFile = "file:///home/wxhsdp/spark/example/standalone/README.md"
> >> i think there needs three left slash behind file:
> >>
> >> it's just the same as val logFile =
> >> "home/wxhsdp/spark/example/standalone/README.md"
> >> the error remains:(
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak
Hi,

You should be able to read it, file://or file:/// not even required for
reading locally , just path is enough..
what error message you getting on spark-shell while reading...
for local:


Also read the same from hdfs file also ...
put your README file there and read , it  works in both ways..
val a= sc.textFile("hdfs://localhost:54310/t/README.md")

also, print stack message of your spark-shell...


On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp  wrote:

> thanks for your reply, adnan, i tried
> val logFile = "file:///home/wxhsdp/spark/example/standalone/README.md"
> i think there needs three left slash behind file:
>
> it's just the same as val logFile =
> "home/wxhsdp/spark/example/standalone/README.md"
> the error remains:(
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Java heap space and spark.akka.frameSize Inbox x

2014-04-21 Thread Arpit Tak
Also check out this post
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html


On Mon, Apr 21, 2014 at 11:49 AM, Akhil Das  wrote:

> Hi Chieh,
>
> You can increase the heap size by exporting the java options (See below,
> will increase the heap size to 10Gb)
>
> export _JAVA_OPTIONS="-Xmx10g"
>
>
>
>
> On Mon, Apr 21, 2014 at 11:43 AM, Chieh-Yen wrote:
>
>> Can anybody help me?
>> Thanks.
>>
>> Chieh-Yen
>>
>>
>> On Wed, Apr 16, 2014 at 5:18 PM, Chieh-Yen wrote:
>>
>>> Dear all,
>>>
>>> I developed a application that the message size of communication
>>> is greater than 10 MB sometimes.
>>> For smaller datasets it works fine, but fails for larger datasets.
>>> Please check the error message following.
>>>
>>> I surveyed the situation online and lots of people said
>>> the problem can be solved by modifying the property
>>> of spark.akka.frameSize
>>> and spark.reducer.maxMbInFlight.
>>> It may look like:
>>>
>>> 134 val conf = new SparkConf()
>>> 135 .setMaster(master)
>>> 136 .setAppName("SparkLR")
>>> 137
>>> .setSparkHome("/home/user/spark-0.9.0-incubating-bin-hadoop2")
>>> 138 .setJars(List(jarPath))
>>> 139 .set("spark.akka.frameSize", "100")
>>> 140 .set("spark.reducer.maxMbInFlight", "100")
>>> 141 val sc = new SparkContext(conf)
>>>
>>> However, the task still fails with the same error message.
>>> The communication message is the weight vectors of each sub-problem,
>>> it may be larger than 10 MB for higher dimensional dataset.
>>>
>>> Is there anybody can help me?
>>> Thanks a lot.
>>>
>>> 
>>> [error] (run-main) org.apache.spark.SparkException: Job aborted:
>>> Exception while deserializing and fetching task:*java.lang.OutOfMemoryError:
>>> Java heap space*
>>> org.apache.spark.SparkException: Job aborted: Exception while
>>> deserializing and fetching task: java.lang.OutOfMemoryError: Java heap space
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>>>  at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
>>>  at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>>>  at scala.Option.foreach(Option.scala:236)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
>>>  at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>  at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> [trace] Stack trace suppressed: run last compile:run for the full output.
>>> 
>>>
>>> Chieh-Yen
>>>
>>
>>
>
>
> --
> Thanks
> Best Regards
>


Re: Task splitting among workers

2014-04-21 Thread Arpit Tak
1.) How about if data is in S3  and we cached in memory , instead of hdfs ?
2.) How is the numbers of reducers determined in both case .

Even if I specify set.mapred.reduce.tasks=50, still somehow reducers
allocated are only 2, instead of 50. Although query/tasks gets completed.

Regards,
Arpit





On Mon, Apr 21, 2014 at 9:33 AM, Patrick Wendell  wrote:

> For a HadoopRDD, first the spark scheduler calculates the number of tasks
> based on input splits. Usually people use this with HDFS data so in that
> case it's based on HDFS blocks. If the HDFS datanodes are co-located with
> the Spark cluster then it will try to run the tasks on the data node that
> contains its input to achieve higher throughput. Otherwise, all of the
> nodes are considered equally fit to run any task, and Spark just load
> balances across them.
>
>
> On Sat, Apr 19, 2014 at 9:25 PM, David Thomas  wrote:
>
>> During a Spark stage, how are tasks split among the workers? Specifically
>> for a HadoopRDD, who determines which worker has to get which task?
>>
>
>


Re: Having spark-ec2 join new slaves to existing cluster

2014-04-18 Thread Arpit Tak
Hi all,

If the cluster is running  and I want to add slaves to existing cluster ,
which is the best way of doing it:
1.) As Matei said, select slave launch more of these
2.) Create a AMI of it and launch more of it like these .

The plus point of first is that its faster , but I have to rync everything
, including ganglia services, passwordless-login etc... (although simple
script will take care of this. )..what generally I do...

With AMI, it takes cares of everything , just had to add slaves in conf
files.

What i think if need to add number of slaves > 15 than go for AMI , instead
of rync ..

What your suggestions here ??

Regards,
Arpit


On Sun, Apr 6, 2014 at 10:26 PM, Rafal Kwasny  wrote:

> Hi,
> This will work nicely unless you're using spot instances, in this case the
> "start" does not work as slaves are lost on shutdown.
> I feel like spark-ec2 script need a major refactor to cope with new
> features/more users using it in dynamic environments.
> Are there any current plans to migrate it to CDH5 (just released) based
> install?
>
> /Raf
>
>
> Nicholas Chammas wrote:
>
> Sweet, thanks for the instructions. This will do for resizing a dev
> cluster that you can bring down at will.
>
> I will open a JIRA issue about adding the functionality I described to
> spark-ec2.
>
>
> On Fri, Apr 4, 2014 at 3:43 PM, Matei Zaharia wrote:
>
>> This can’t be done through the script right now, but you can do it
>> manually as long as the cluster is stopped. If the cluster is stopped, just
>> go into the AWS Console, right click a slave and choose “launch more of
>> these” to add more. Or select multiple slaves and delete them. When you run
>> spark-ec2 start the next time to start your cluster, it will set it up on
>> all the machines it finds in the mycluster-slaves security group.
>>
>> This is pretty hacky so it would definitely be good to add this feature;
>> feel free to open a JIRA about it.
>>
>> Matei
>>
>> On Apr 4, 2014, at 12:16 PM, Nicholas Chammas 
>> wrote:
>>
>> I would like to be able to use spark-ec2 to launch new slaves and add
>> them to an existing, running cluster. Similarly, I would also like to
>> remove slaves from an existing cluster.
>>
>> Use cases include:
>>
>>1. Oh snap, I sized my cluster incorrectly. Let me add/remove some
>>slaves.
>>2. During scheduled batch processing, I want to add some new slaves,
>>perhaps on spot instances. When that processing is done, I want to kill
>>them. (Cruel, I know.)
>>
>> I gather this is not possible at the moment. spark-ec2 appears to be able
>> to launch new slaves for an existing cluster only if the master is stopped.
>> I also do not see any ability to remove slaves from a cluster.
>>
>> Is that correct? Are there plans to add such functionality to spark-ec2
>> in the future?
>>
>> Nick
>>
>>
>> --
>> View this message in context: Having spark-ec2 join new slaves to
>> existing 
>> cluster
>> Sent from the Apache Spark User List mailing list 
>> archiveat
>> Nabble.com.
>>
>>
>>
>
>


Re: AmpCamp exercise in a local environment

2014-04-18 Thread Arpit Tak
Download Cloudera VM from here.

https://drive.google.com/file/d/0B7zn-Mmft-XcdTZPLXltUjJyeUE/edit?usp=sharing

Regards,
Arpit Tak


On Fri, Apr 18, 2014 at 1:20 PM, Arpit Tak wrote:

> HI Nabeel,
>
> I have a cloudera VM , It has both spark and shark installed in it.
> You can download and play around with it . i also have some sample data in
> hdfs and some table .
>
> You can try out those examples. How to use it ..(instructions are in
> docs...).
>
>
> https://drive.google.com/file/d/0B0Q4Le4DZj5iSndIcFBfQlcxM1NlV3RNN3YzU1dOT1ZjZHJJ/edit?usp=sharing
>
> But for AmpCamp-exercises , you need ec2 only to get wikidata on your
> hdfs. For that I have uploaded file(50Mb) . Just download it and put on
> hdfs .. and you can work around these exercises...
>
>
> https://drive.google.com/a/mobipulse.in/uc?id=0B0Q4Le4DZj5iNUdSZXpFTUJEU0E&export=download
>
> You will love it...
>
> Regards,
> Arpit Tak
>
>
> On Tue, Apr 15, 2014 at 4:28 AM, Nabeel Memon  wrote:
>
>> Hi. I found AmpCamp exercises as a nice way to get started with spark.
>> However they require amazon ec2 access. Has anyone put together any VM or
>> docker scripts to have the same environment locally to work out those labs?
>>
>> It'll be really helpful. Thanks.
>>
>
>


Re: AmpCamp exercise in a local environment

2014-04-18 Thread Arpit Tak
HI Nabeel,

I have a cloudera VM , It has both spark and shark installed in it. You
can download and play around with it . i also have some sample data in hdfs
and some table .

You can try out those examples. How to use it ..(instructions are in
docs...).

https://drive.google.com/file/d/0B0Q4Le4DZj5iSndIcFBfQlcxM1NlV3RNN3YzU1dOT1ZjZHJJ/edit?usp=sharing

But for AmpCamp-exercises , you need ec2 only to get wikidata on your hdfs.
For that I have uploaded file(50Mb) . Just download it and put on hdfs ..
and you can work around these exercises...

https://drive.google.com/a/mobipulse.in/uc?id=0B0Q4Le4DZj5iNUdSZXpFTUJEU0E&export=download

You will love it...

Regards,
Arpit Tak


On Tue, Apr 15, 2014 at 4:28 AM, Nabeel Memon  wrote:

> Hi. I found AmpCamp exercises as a nice way to get started with spark.
> However they require amazon ec2 access. Has anyone put together any VM or
> docker scripts to have the same environment locally to work out those labs?
>
> It'll be really helpful. Thanks.
>


Re: Shark: ClassNotFoundException org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

2014-04-17 Thread Arpit Tak
Just for curiosity , as you are using Cloudera-Manager hadoop and spark..
How you build shark .for it??

are you able to read any file from hdfs ...did you tried that out..???


Regards,
Arpit Tak


On Thu, Apr 17, 2014 at 7:07 PM, ge ko  wrote:

> Hi,
>
> the error java.lang.ClassNotFoundException:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat has been
> resolved by adding
> parquet-hive-bundle-1.4.1.jar to shark's lib folder.
> Now the Hive metastore can be read successfully (also the parquet based
> table).
>
> But if I want to select from that table I receive:
>
> org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 4 times
> (most recent failure: Exception failure: java.lang.ClassNotFoundException:
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>
> This is really strange, since the class
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe is included in
> the parquet-hive-bundle-1.4.1.jar ?!?!
> ...getting more and more confused ;)
>
> any help ?
>
> regards, Gerd
>
>
> On 17 April 2014 11:55, ge ko  wrote:
>
>> Hi,
>>
>> I want to select from a parquet based table in shark, but receive the
>> error:
>>
>> shark> select * from wl_parquet;
>> 14/04/17 11:33:49 INFO shark.SharkCliDriver: Execution Mode: shark
>> 14/04/17 11:33:49 INFO ql.Driver: 
>> 14/04/17 11:33:49 INFO ql.Driver: 
>> 14/04/17 11:33:49 INFO ql.Driver: 
>> 14/04/17 11:33:49 INFO parse.ParseDriver: Parsing command: select * from
>> wl_parquet
>> 14/04/17 11:33:49 INFO parse.ParseDriver: Parse Completed
>> 14/04/17 11:33:49 INFO parse.SharkSemanticAnalyzer: Get metadata for
>> source tables
>> FAILED: Hive Internal Error:
>> java.lang.RuntimeException(java.lang.ClassNotFoundException:
>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat)
>> 14/04/17 11:33:50 ERROR shark.SharkDriver: FAILED: Hive Internal Error:
>> java.lang.RuntimeException(java.lang.ClassNotFoundException:
>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat)
>> java.lang.RuntimeException: java.lang.ClassNotFoundException:
>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>> at
>> org.apache.hadoop.hive.ql.metadata.Table.getInputFormatClass(Table.java:306)
>> at org.apache.hadoop.hive.ql.metadata.Table.(Table.java:99)
>> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:988)
>> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:891)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1083)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1059)
>> at
>> shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:137)
>> at
>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:279)
>> at shark.SharkDriver.compile(SharkDriver.scala:215)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
>> at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:338)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
>> at shark.SharkCliDriver$.main(SharkCliDriver.scala:235)
>> at shark.SharkCliDriver.main(SharkCliDriver.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:270)
>> at
>> org.apache.hadoop.hive.ql.metadata.Table.getInputFormatClass(Table.java:302)
>> ... 14 more
>>
>> I can successfully select from that table with Hive and Impala, but shark
>> doesn't work. I am using CDH5 incl. Spark 

Re: Spark on Yarn or Mesos?

2014-04-17 Thread Arpit Tak
Hi Wel,

Take a look at this post...
http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-td2016.html

Regards,
Arpit Tak


On Thu, Apr 17, 2014 at 3:42 PM, Wei Wang  wrote:

> Hi, there
>
> I would like to know is there any differences between Spark on Yarn and
> Spark on Mesos. Is there any comparision between them? What are the
> advantages and disadvantages for each of them. Is there any criterion for
> choosing between Yarn and Mesos?
>
> BTW, we need MPI in our framework, and I saw MPICH2 is included in Mesos.
> Should it be the reason for choosing Mesos?
>
> Thanks a lot!
>
>
> Weida
>


Re: sbt assembly error

2014-04-16 Thread Arpit Tak
Its because , there is no sl4f directory exists there may be they
updating it .
https://oss.sonatype.org/content/repositories/snapshots/org/

Hard luck  try after some time...

Regards,
Arpit


On Thu, Apr 17, 2014 at 12:33 AM, Yiou Li  wrote:

> Hi all,
>
> I am trying to build spark assembly using sbt and got connection error
> when resolving dependencies:
>
> I tried web browser and wget some of the dependency links in the error and
> also got 404 error too.
>
> This happened to the following branches:
> spark-0.8.1-incubating
> spark-0.9.1
> spark-0.9.1-bin-hadoop2
>
> Can somebody please kindly advise?
>
> Best,
> Leo
>
>
> Launching sbt from sbt/sbt-launch-0.12.4.jar
> [info] Loading project definition from
> /home/leo/workspace_spark/spark-core/spark-0.9.1-bin-hadoop2/project/project
> [info] Loading project definition from
> /home/leo/workspace_spark/spark-core/spark-0.9.1-bin-hadoop2/project
> [info] Set current project to root (in build
> file:/home/leo/workspace_spark/spark-core/spark-0.9.1-bin-hadoop2/)
> [info] Updating
> {file:/home/leo/workspace_spark/spark-core/spark-0.9.1-bin-hadoop2/}core...
> [info] Resolving org.slf4j#slf4j-log4j12;1.7.2 ...
> [error] Server access Error: Connection timed out url=
> https://oss.sonatype.org/content/repositories/snapshots/org/slf4j/slf4j-log4j12/1.7.2/slf4j-log4j12-1.7.2.pom
> [error] Server access Error: Connection timed out url=
> https://oss.sonatype.org/service/local/staging/deploy/maven2/org/slf4j/slf4j-log4j12/1.7.2/slf4j-log4j12-1.7.2.pom
> [info] Resolving commons-daemon#commons-daemon;1.0.10 ...
> [error] Server access Error: Connection timed out url=
> https://oss.sonatype.org/content/repositories/snapshots/commons-daemon/commons-daemon/1.0.10/commons-daemon-1.0.10.pom
> [error] Server access Error: Connection timed out url=
> https://oss.sonatype.org/service/local/staging/deploy/maven2/commons-daemon/commons-daemon/1.0.10/commons-daemon-1.0.10.pom
> [error] Server access Error: Connection timed out url=
> https://repository.cloudera.com/artifactory/cloudera-repos/commons-daemon/commons-daemon/1.0.10/commons-daemon-1.0.10.pom
> [info] Resolving org.apache.commons#commons-parent;23 ...
> [error] Server access Error: Connection timed out url=
> https://oss.sonatype.org/content/repositories/snapshots/org/apache/commons/commons-parent/23/commons-parent-23.pom
>
> (truncated)
>


Re: Spark packaging

2014-04-16 Thread Arpit Tak
Also try this ...
http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Ubuntu-12.04

http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_HortonWorks_VM

Regards,
arpit


On Thu, Apr 10, 2014 at 3:04 AM, Pradeep baji
wrote:

> Thanks Prabeesh.
>
>
> On Wed, Apr 9, 2014 at 12:37 AM, prabeesh k  wrote:
>
>> Please refer
>>
>> http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html
>>
>> Regards,
>> prabeesh
>>
>>
>> On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji > > wrote:
>>
>>> Hi all,
>>>
>>> I am new to spark and trying to learn it. Is there any document which
>>> describes how spark is packaged. ( like dependencies needed to build spark,
>>> which jar contains what after build etc)
>>>
>>> Thanks for the help.
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>
>


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-16 Thread Arpit Tak
I too stuck on same issue , but on shark (0.9 with spark-0.9 ) on
hadoop-2.2.0 .

On rest hadoop versions , it works perfect

Regards,
Arpit Tak


On Wed, Apr 16, 2014 at 11:18 PM, Aureliano Buendia wrote:

> Is this resolved in spark 0.9.1?
>
>
> On Tue, Apr 15, 2014 at 6:55 PM, anant  wrote:
>
>> I've received the same error with Spark built using Maven. It turns out
>> that
>> mesos-0.13.0 depends on protobuf-2.4.1 which is causing the clash at
>> runtime. Protobuf included by Akka is shaded and doesn't cause any
>> problems.
>>
>> The solution is to update the mesos dependency to 0.18.0 in spark's
>> pom.xml.
>> Rebuilding the JAR with this configuration solves the issue.
>>
>> -Anant
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p4286.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: Shark: class java.io.IOException: Cannot run program "/bin/java"

2014-04-16 Thread Arpit Tak
just set your java class path properly

export JAVA_HOME=/usr/lib/jvm/java-7-. (somewhat like this...whatever
version you having)

it will work

Regards,
Arpit


On Wed, Apr 16, 2014 at 1:24 AM, ge ko  wrote:

> Hi,
>
>
>
> after starting the shark-shell
> via /opt/shark/shark-0.9.1/bin/shark-withinfo -skipRddReload I receive lots
> of output, including the exception that /bin/java cannot be executed. But
> it is linked to /usr/bin/java ?!?!
>
>
>
> root#>ls -al /bin/java
>
> lrwxrwxrwx 1 root root 13 15. Apr 21:45 /bin/java -> /usr/bin/java
>
> root#>/bin/java -version
>
> java version "1.7.0_51"
> OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
>
>
>
> Starting the shark shell:
>
>
>
> [root@hadoop-pg-5 bin]# /opt/shark/shark-0.9.1/bin/shark-withinfo
> -skipRddReload
> -hiveconf hive.root.logger=INFO,console -skipRddReload
> Starting the Shark Command Line Client
> 14/04/15 21:45:57 WARN conf.HiveConf: DEPRECATED: Configuration property
> hive.metastore.local no longer has any effect. Make sure to provide a valid
> value for hive.metastore.uris if you are connecting to a remote metastore.
> 14/04/15 21:45:58 WARN conf.HiveConf: DEPRECATED: Configuration property
> hive.metastore.local no longer has any effect. Make sure to provide a valid
> value for hive.metastore.uris if you are connecting to a remote metastore.
>
> Logging initialized using configuration in
> jar:file:/opt/shark/shark-0.9.1/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties
> 14/04/15 21:45:58 INFO SessionState:
> Logging initialized using configuration in
> jar:file:/opt/shark/shark-0.9.1/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties
> Hive history
> file=/tmp/root/hive_job_log_root_22574@hadoop-pg-5.cluster_201404152145_159664609.txt
> 14/04/15 21:45:58 INFO exec.HiveHistory: Hive history
> file=/tmp/root/hive_job_log_root_22574@hadoop-pg-5.cluster_201404152145_159664609.txt
> 14/04/15 21:45:58 WARN conf.HiveConf: DEPRECATED: Configuration property
> hive.metastore.local no longer has any effect. Make sure to provide a valid
> value for hive.metastore.uris if you are connecting to a remote metastore.
> 14/04/15 21:45:59 WARN conf.HiveConf: DEPRECATED: Configuration property
> hive.metastore.local no longer has any effect. Make sure to provide a valid
> value for hive.metastore.uris if you are connecting to a remote metastore.
> 14/04/15 21:46:00 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 14/04/15 21:46:00 INFO Remoting: Starting remoting
> 14/04/15 21:46:00 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@hadoop-pg-5.cluster:38835]
> 14/04/15 21:46:00 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@hadoop-pg-5.cluster:38835]
> 14/04/15 21:46:00 INFO spark.SparkEnv: Registering BlockManagerMaster
> 5,108: [GC 262656K->26899K(1005568K), 0,0409080 secs]
> 14/04/15 21:46:00 INFO storage.DiskBlockManager: Created local directory
> at /tmp/spark-local-20140415214600-9537
> 14/04/15 21:46:00 INFO storage.MemoryStore: MemoryStore started with
> capacity 589.2 MB.
> 14/04/15 21:46:00 INFO network.ConnectionManager: Bound socket to port
> 51889 with id = ConnectionManagerId(hadoop-pg-5.cluster,51889)
> 14/04/15 21:46:00 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 14/04/15 21:46:00 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Registering block manager hadoop-pg-5.cluster:51889 with 589.2 MB RAM
> 14/04/15 21:46:00 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/04/15 21:46:00 INFO spark.HttpServer: Starting HTTP Server
> 14/04/15 21:46:00 INFO server.Server: jetty-7.6.8.v20121106
> 14/04/15 21:46:00 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:59414
> 14/04/15 21:46:00 INFO broadcast.HttpBroadcast: Broadcast server started
> at http://10.147.210.5:59414
> 14/04/15 21:46:01 INFO spark.SparkEnv: Registering MapOutputTracker
> 14/04/15 21:46:01 INFO spark.HttpFileServer: HTTP File server directory is
> /tmp/spark-cf56ada9-d950-4abc-a1c3-76fecdc4faa3
> 14/04/15 21:46:01 INFO spark.HttpServer: Starting HTTP Server
> 14/04/15 21:46:01 INFO server.Server: jetty-7.6.8.v20121106
> 14/04/15 21:46:01 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:45689
> 14/04/15 21:46:01 INFO server.Server: jetty-7.6.8.v20121106
> 14/04/15 21:46:01 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage/rdd,null}
> 14/04/15 21:46:01 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage,null}
> 14/04/15 21:46:01 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/stage,null}
> 14/04/15 21:46:01 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/pool,null}
> 14/04/15 21:46:01 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages,null}
> 14/0

Create cache fails on first time

2014-04-16 Thread Arpit Tak
I am loading some data(25GB) in shark from hdfs : spark,shark ( both- 0.9)
. Generally it happens that caching a table some time fails, for the very
first time we are   caching data. Second time it runs successfully ...

Anybody facing same issue ??..

*Shark Client Log:*
> create table sample_cached as access_cached * from access;
[Hive Error]: Query returned non-zero code: 9, cause: FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Time taken (including network latency): 68.016 seconds

*SharkServer Log:*
Moving data to: hdfs://
ec2-xyz.compute-1.amazonaws.com:9000/user/hive/warehouse/access_cached
Failed with exception Unable to rename: hdfs://
ec2-xyz.compute-1.amazonaws.com:9000/tmp/hive-root/hive_2014-04-16_10-52-10_487_3421016764043167178/-ext-10004to:
hdfs://
ec2-xyz.compute-1.amazonaws.com:9000/user/hive/warehouse/access_cached
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask


Regards,
Arpit Tak


Re: Proper caching method

2014-04-16 Thread Arpit Tak
Thanks Cheng , that was helpful..


On Wed, Apr 16, 2014 at 1:29 PM, Cheng Lian  wrote:

> You can remove cached rdd1 from the cache manager by calling
> rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while
> RDD.unpersist() is *eager*. When .cache() is called, it just tells Spark
> runtime to cache the RDD *later* when corresponding job that uses this
> RDD is submitted; when .unpersist() is called, the cached RDD is removed
> immediately. So you may want to do something like this to avoid rdd1taking 
> too much memory:
>
> val rdd1 = sc.textFile(path).cache()val rdd2 = rdd1.filter(...).cache()val 
> rdd3 = rdd1.filter(...).cache()
> // Trigger a job to materialize and cache rdd1, rdd2 & rdd3
> (rdd2 ++ rdd3).count()
> // Remove rdd1
> rdd1.unpersist()
> // Use rdd2 & rdd3 for later logics.
>
> In this way, an additional job is required so that you have chance to
> evict rdd1 as early as possible.
>
>
> On Wed, Apr 16, 2014 at 2:43 PM, Arpit Tak wrote:
>
>> Hi Cheng,
>>
>> Is it possibe to delete or replicate an rdd ??
>>
>>
>> > rdd1 = textFile("hdfs...").cache()
>> >
>> > rdd2 = rdd1.filter(userDefinedFunc1).cache()
>> > rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>
>> I reframe above question , if rdd1 is around 50G and after filtering its
>> come around say 4G.
>> then to increase computing performance we just cached it .. but rdd2 and
>> rdd3 are on disk ..
>> so this will show somehow show good performance than performing filter on
>> disk , then caching rdd2 and rdd3.
>>
>> or can we also remove a particular rdd from cache say rdd1(if cached)
>> after filtered operation as its not required and we save memory usage.
>>
>> Regards,
>> Arpit
>>
>>
>> On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian wrote:
>>
>>> Hi Joe,
>>>
>>> You need to make sure which RDD is used most frequently. In your case,
>>> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively
>>> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or
>>> rdd3 if rdd1 is not referenced elsewhere.
>>>
>>> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of
>>> them, you end up with 11G memory consumption, which might not be what you
>>> want.
>>>
>>> Regards
>>> Cheng
>>>
>>>
>>> On Mon, Apr 14, 2014 at 8:32 PM, Joe L  wrote:
>>>
>>>> Hi I am trying to cache 2Gbyte data and to implement the following
>>>> procedure.
>>>> In order to cache them I did as follows: Is it necessary to cache rdd2
>>>> since
>>>> rdd1 is already cached?
>>>>
>>>> rdd1 = textFile("hdfs...").cache()
>>>>
>>>> rdd2 = rdd1.filter(userDefinedFunc1).cache()
>>>> rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>


Re: Proper caching method

2014-04-15 Thread Arpit Tak
Hi Cheng,

Is it possibe to delete or replicate an rdd ??

> rdd1 = textFile("hdfs...").cache()
>
> rdd2 = rdd1.filter(userDefinedFunc1).cache()
> rdd3 = rdd1.filter(userDefinedFunc2).cache()

I reframe above question , if rdd1 is around 50G and after filtering its
come around say 4G.
then to increase computing performance we just cached it .. but rdd2 and
rdd3 are on disk ..
so this will show somehow show good performance than performing filter on
disk , then caching rdd2 and rdd3.

or can we also remove a particular rdd from cache say rdd1(if cached) after
filtered operation as its not required and we save memory usage.

Regards,
Arpit


On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian  wrote:

> Hi Joe,
>
> You need to make sure which RDD is used most frequently. In your case,
> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively
> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or
> rdd3 if rdd1 is not referenced elsewhere.
>
> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of
> them, you end up with 11G memory consumption, which might not be what you
> want.
>
> Regards
> Cheng
>
>
> On Mon, Apr 14, 2014 at 8:32 PM, Joe L  wrote:
>
>> Hi I am trying to cache 2Gbyte data and to implement the following
>> procedure.
>> In order to cache them I did as follows: Is it necessary to cache rdd2
>> since
>> rdd1 is already cached?
>>
>> rdd1 = textFile("hdfs...").cache()
>>
>> rdd2 = rdd1.filter(userDefinedFunc1).cache()
>> rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: Spark resilience

2014-04-15 Thread Arpit Tak
1. If we add more executors to cluster and data is already cached inside
system(rdds are already there) . so, in that case
those executors will run job on new executors or not , as rdd are not
present there??
if yes, then how the performance on new executors ??

2. What is the replication factor in spark in memory (as for hadoop default
is 3 ) and can we change for spark also ??




On Tue, Apr 15, 2014 at 9:53 PM, Manoj Samel wrote:

> Thanks Aaron, this is useful !
>
> - Manoj
>
>
> On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson wrote:
>
>> Launching drivers inside the cluster was a feature added in 0.9, for
>> standalone cluster mode:
>> http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
>>
>> Note the "supervise" flag, which will cause the driver to be restarted if
>> it fails. This is a rather low-level mechanism which by default will just
>> cause the whole job to rerun from the beginning. Special recovery would
>> have to be implemented by hand, via some sort of state checkpointing into a
>> globally visible storage system (e.g., HDFS), which, for example, Spark
>> Streaming already does.
>>
>> Currently, this feature is not supported in YARN or Mesos fine-grained
>> mode.
>>
>>
>> On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel wrote:
>>
>>> Could you please elaborate how drivers can be restarted automatically ?
>>>
>>> Thanks,
>>>
>>>
>>> On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson wrote:
>>>
 Master and slave are somewhat overloaded terms in the Spark ecosystem
 (see the glossary:
 http://spark.apache.org/docs/latest/cluster-overview.html#glossary).
 Are you actually asking about the Spark "driver" and "executors", or the
 standalone cluster "master" and "workers"?

 To briefly answer for either possibility:
 (1) Drivers are not fault tolerant but can be restarted automatically,
 Executors may be removed at any point without failing the job (though
 losing an Executor may slow the job significantly), and Executors may be
 added at any point and will be immediately used.
 (2) Standalone cluster Masters are fault tolerant and failure will only
 temporarily stall new jobs from starting or getting new resources, but does
 not affect currently-running jobs. Workers can fail and will simply cause
 jobs to lose their current Executors. New Workers can be added at any 
 point.



 On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira >>> > wrote:

> Folks,
>
> I was wondering what the failure support modes where for Spark while
> running jobs
>
>
>1. What happens when a master fails
>2. What happens when a slave fails
>3. Can you mid job add and remove slaves
>
>
> Regarding the install on Meso, if I understand correctly the Spark
> master is behind a Zookeeper quorum so that isolates the slaves from a
> master failure, but what about the masters behind quorum?
>
> Cheers
> - Ian
>
>

>>>
>>
>