No of Spark context per jvm
Hi, As far as I know you can create one SparkContext per jvm, but wanted to confirm if it's one per jvm or one per classloader. As in one SparkContext created per *. war, all deployment under one tomcat instance Regards, Praveen
Re: Spark Streaming with Kafka Use Case
Sorry.. Rephrasing : Can this issue be resolved by having a smaller block interval? Regards, Praveen On 18 Feb 2016 21:30, "praveen S" <mylogi...@gmail.com> wrote: > Can having a smaller block interval only resolve this? > > Regards, > Praveen > On 18 Feb 2016 21:13, "Cody Koeninger" <c...@koeninger.org> wrote: > >> Backpressure won't help you with the first batch, you'd need >> spark.streaming.kafka.maxRatePerPartition >> for that >> >> On Thu, Feb 18, 2016 at 9:40 AM, praveen S <mylogi...@gmail.com> wrote: >> >>> Have a look at >>> >>> spark.streaming.backpressure.enabled >>> Property >>> >>> Regards, >>> Praveen >>> On 18 Feb 2016 00:13, "Abhishek Anand" <abhis.anan...@gmail.com> wrote: >>> >>>> I have a spark streaming application running in production. I am trying >>>> to find a solution for a particular use case when my application has a >>>> downtime of say 5 hours and is restarted. Now, when I start my streaming >>>> application after 5 hours there would be considerable amount of data then >>>> in the Kafka and my cluster would be unable to repartition and process >>>> that. >>>> >>>> Is there any workaround so that when my streaming application starts it >>>> starts taking data for 1-2 hours, process it , then take the data for next >>>> 1 hour process it. Now when its done processing of previous 5 hours data >>>> which missed, normal streaming should start with the given slide interval. >>>> >>>> Please suggest any ideas and feasibility of this. >>>> >>>> >>>> Thanks !! >>>> Abhi >>>> >>> >>
Re: Spark Streaming with Kafka Use Case
Can having a smaller block interval only resolve this? Regards, Praveen On 18 Feb 2016 21:13, "Cody Koeninger" <c...@koeninger.org> wrote: > Backpressure won't help you with the first batch, you'd need > spark.streaming.kafka.maxRatePerPartition > for that > > On Thu, Feb 18, 2016 at 9:40 AM, praveen S <mylogi...@gmail.com> wrote: > >> Have a look at >> >> spark.streaming.backpressure.enabled >> Property >> >> Regards, >> Praveen >> On 18 Feb 2016 00:13, "Abhishek Anand" <abhis.anan...@gmail.com> wrote: >> >>> I have a spark streaming application running in production. I am trying >>> to find a solution for a particular use case when my application has a >>> downtime of say 5 hours and is restarted. Now, when I start my streaming >>> application after 5 hours there would be considerable amount of data then >>> in the Kafka and my cluster would be unable to repartition and process that. >>> >>> Is there any workaround so that when my streaming application starts it >>> starts taking data for 1-2 hours, process it , then take the data for next >>> 1 hour process it. Now when its done processing of previous 5 hours data >>> which missed, normal streaming should start with the given slide interval. >>> >>> Please suggest any ideas and feasibility of this. >>> >>> >>> Thanks !! >>> Abhi >>> >> >
Re: Spark Streaming with Kafka Use Case
Have a look at spark.streaming.backpressure.enabled Property Regards, Praveen On 18 Feb 2016 00:13, "Abhishek Anand"wrote: > I have a spark streaming application running in production. I am trying to > find a solution for a particular use case when my application has a > downtime of say 5 hours and is restarted. Now, when I start my streaming > application after 5 hours there would be considerable amount of data then > in the Kafka and my cluster would be unable to repartition and process that. > > Is there any workaround so that when my streaming application starts it > starts taking data for 1-2 hours, process it , then take the data for next > 1 hour process it. Now when its done processing of previous 5 hours data > which missed, normal streaming should start with the given slide interval. > > Please suggest any ideas and feasibility of this. > > > Thanks !! > Abhi >
Re: Best practises of share Spark cluster over few applications
Even i was trying to launch spark jobs from webservice : But I thought you could run spark jobs in yarn mode only through spark-submit. Is my understanding not correct? Regards, Praveen On 15 Feb 2016 08:29, "Sabarish Sasidharan"wrote: > Yes you can look at using the capacity scheduler or the fair scheduler > with YARN. Both allow using full cluster when idle. And both allow > considering cpu plus memory when allocating resources which is sort of > necessary with Spark. > > Regards > Sab > On 13-Feb-2016 10:11 pm, "Eugene Morozov" > wrote: > >> Hi, >> >> I have several instances of the same web-service that is running some ML >> algos on Spark (both training and prediction) and do some Spark unrelated >> job. Each web-service instance creates their on JavaSparkContext, thus >> they're seen as separate applications by Spark, thus they're configured >> with separate limits of resources such as cores (I'm not concerned about >> the memory as much as about cores). >> >> With this set up, say 3 web service instances, each of them has just 1/3 >> of cores. But it might happen, than only one instance is going to use >> Spark, while others are busy with Spark unrelated. I'd like in this case >> all Spark cores be available for the one that's in need. >> >> Ideally I'd like Spark cores just be available in total and the first app >> who needs it, takes as much as required from the available at the moment. >> Is it possible? I believe Mesos is able to set resources free if they're >> not in use. Is it possible with YARN? >> >> I'd appreciate if you could share your thoughts or experience on the >> subject. >> >> Thanks. >> -- >> Be well! >> Jean Morozov >> >
AM creation in yarn client mode
Hi, I have 2 questions when running the spark jobs on yarn in client mode : 1) Where is the AM(application master) created : A) is it created on the client where the job was submitted? i.e driver and AM on the same client? Or B) yarn decides where the the AM should be created? 2) Driver and AM run in different processes : is my assumption correct? Regards, Praveen
Re: AM creation in yarn-client mode
Can you explain what happens in yarn client mode? Regards, Praveen On 10 Feb 2016 10:55, "ayan guha" <guha.a...@gmail.com> wrote: > It depends on yarn-cluster and yarn-client mode. > > On Wed, Feb 10, 2016 at 3:42 PM, praveen S <mylogi...@gmail.com> wrote: > >> Hi, >> >> I have 2 questions when running the spark jobs on yarn in client mode : >> >> 1) Where is the AM(application master) created : >> >> A) is it created on the client where the job was submitted? i.e driver >> and AM on the same client? >> Or >> B) yarn decides where the the AM should be created? >> >> 2) Driver and AM run in different processes : is my assumption correct? >> >> Regards, >> Praveen >> > > > > -- > Best Regards, > Ayan Guha >
Re: Create a n x n graph given only the vertices no
Hi Robin, I am using Spark 1.3 and I am not able to find the api Graph.fromEdgeTuples(edge RDD, 1) Regards, Praveen Well you can use a similar tech to generate an RDD[(Long, Long)] (that’s what the edges variable is) and then create the Graph using Graph.fromEdgeTuples. --- Robin East *Spark GraphX in Action* Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 11 Jan 2016, at 12:30, praveen S <mylogi...@gmail.com> wrote: Yes I was looking something of that sort.. Thank you. Actually I was looking for a way to connect nodes based on the property of the nodes.. I have a set of nodes and I know the condition on which I can create an edge.. On 11 Jan 2016 14:06, "Robin East" <robin.e...@xense.co.uk> wrote: > Do you mean to create a perfect graph? You can do it like this: > > scala> import org.apache.spark.graphx._ > import org.apache.spark.graphx._ > > scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L)) > vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at > parallelize at :36 > > scala> val edges = vert.cartesian(vert) > edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at > cartesian at :38 > > scala> val g = Graph.fromEdgeTuples(edges, 1) > > > That will give you a graph where all vertices connect to every other > vertices. it will give you self joins as well i.e. vert 1 has an edge to > vert 1, you can deal with this using a filter on the edges RDD if you don’t > want self-joins > > -- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 11 Jan 2016, at 03:19, praveen S <mylogi...@gmail.com> wrote: > > Is it possible in graphx to create/generate graph of n x n given only the > vertices. > On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote: > >> Is it possible in graphx to create/generate a graph n x n given n >> vertices? >> > >
Re: Create a n x n graph given only the vertices no
Sorry.. Found the api.. On 21 Jan 2016 10:17, "praveen S" <mylogi...@gmail.com> wrote: > Hi Robin, > > I am using Spark 1.3 and I am not able to find the api > Graph.fromEdgeTuples(edge RDD, 1) > > Regards, > Praveen > Well you can use a similar tech to generate an RDD[(Long, Long)] (that’s > what the edges variable is) and then create the Graph using > Graph.fromEdgeTuples. > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 11 Jan 2016, at 12:30, praveen S <mylogi...@gmail.com> wrote: > > Yes I was looking something of that sort.. Thank you. > > Actually I was looking for a way to connect nodes based on the property of > the nodes.. I have a set of nodes and I know the condition on which I can > create an edge.. > On 11 Jan 2016 14:06, "Robin East" <robin.e...@xense.co.uk> wrote: > >> Do you mean to create a perfect graph? You can do it like this: >> >> scala> import org.apache.spark.graphx._ >> import org.apache.spark.graphx._ >> >> scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L)) >> vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at >> parallelize at :36 >> >> scala> val edges = vert.cartesian(vert) >> edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at >> cartesian at :38 >> >> scala> val g = Graph.fromEdgeTuples(edges, 1) >> >> >> That will give you a graph where all vertices connect to every other >> vertices. it will give you self joins as well i.e. vert 1 has an edge to >> vert 1, you can deal with this using a filter on the edges RDD if you don’t >> want self-joins >> >> ------ >> Robin East >> *Spark GraphX in Action* Michael Malak and Robin East >> Manning Publications Co. >> http://www.manning.com/books/spark-graphx-in-action >> >> >> >> >> >> On 11 Jan 2016, at 03:19, praveen S <mylogi...@gmail.com> wrote: >> >> Is it possible in graphx to create/generate graph of n x n given only the >> vertices. >> On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote: >> >>> Is it possible in graphx to create/generate a graph n x n given n >>> vertices? >>> >> >> >
Re: Reuse Executor JVM across different JobContext
Can you give me more details on Spark's jobserver. Regards, Praveen On 18 Jan 2016 03:30, "Jia"wrote: > I guess all jobs submitted through JobServer are executed in the same JVM, > so RDDs cached by one job can be visible to all other jobs executed later. > On Jan 17, 2016, at 3:56 PM, Mark Hamstra wrote: > > Yes, that is one of the basic reasons to use a > jobserver/shared-SparkContext. Otherwise, in order share the data in an > RDD you have to use an external storage system, such as a distributed > filesystem or Tachyon. > > On Sun, Jan 17, 2016 at 1:52 PM, Jia wrote: > >> Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem, >> so that jobs can be submitted at different time and still share RDDs. >> >> Best Regards, >> Jia >> >> >> On Jan 17, 2016, at 3:44 PM, Mark Hamstra >> wrote: >> >> There is a 1-to-1 relationship between Spark Applications and >> SparkContexts -- fundamentally, a Spark Applications is a program that >> creates and uses a SparkContext, and that SparkContext is destroyed when >> then Application ends. A jobserver generically and the Spark JobServer >> specifically is an Application that keeps a SparkContext open for a long >> time and allows many Jobs to be be submitted and run using that shared >> SparkContext. >> >> More than one Application/SparkContext unavoidably implies more than one >> JVM process per Worker -- Applications/SparkContexts cannot share JVM >> processes. >> >> On Sun, Jan 17, 2016 at 1:15 PM, Jia wrote: >> >>> Hi, Mark, sorry for the confusion. >>> >>> Let me clarify, when an application is submitted, the master will tell >>> each Spark worker to spawn an executor JVM process. All the task sets of >>> the application will be executed by the executor. After the application >>> runs to completion. The executor process will be killed. >>> But I hope that all applications submitted can run in the same executor, >>> can JobServer do that? If so, it’s really good news! >>> >>> Best Regards, >>> Jia >>> >>> On Jan 17, 2016, at 3:09 PM, Mark Hamstra >>> wrote: >>> >>> You've still got me confused. The SparkContext exists at the Driver, >>> not on an Executor. >>> >>> Many Jobs can be run by a SparkContext -- it is a common pattern to use >>> something like the Spark Jobserver where all Jobs are run through a shared >>> SparkContext. >>> >>> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou >>> wrote: >>> Hi, Mark, sorry, I mean SparkContext. I mean to change Spark into running all submitted jobs (SparkContexts) in one executor JVM. Best Regards, Jia On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra wrote: > -dev > > What do you mean by JobContext? That is a Hadoop mapreduce concept, > not Spark. > > On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou > wrote: > >> Dear all, >> >> Is there a way to reuse executor JVM across different JobContexts? >> Thanks. >> >> Best Regards, >> Jia >> > > >>> >>> >> >> > >
Usage of SparkContext within a Web container
Is use of SparkContext from a Web container a right way to process spark jobs or should we use spark-submit in a processbuilder? Are there any pros or cons of using SparkContext from a Web container..? How does zeppelin trigger spark jobs from the Web context?
Re: Create a n x n graph given only the vertices no
Yes I was looking something of that sort.. Thank you. Actually I was looking for a way to connect nodes based on the property of the nodes.. I have a set of nodes and I know the condition on which I can create an edge.. On 11 Jan 2016 14:06, "Robin East" <robin.e...@xense.co.uk> wrote: > Do you mean to create a perfect graph? You can do it like this: > > scala> import org.apache.spark.graphx._ > import org.apache.spark.graphx._ > > scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L)) > vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at > parallelize at :36 > > scala> val edges = vert.cartesian(vert) > edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at > cartesian at :38 > > scala> val g = Graph.fromEdgeTuples(edges, 1) > > > That will give you a graph where all vertices connect to every other > vertices. it will give you self joins as well i.e. vert 1 has an edge to > vert 1, you can deal with this using a filter on the edges RDD if you don’t > want self-joins > > -- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 11 Jan 2016, at 03:19, praveen S <mylogi...@gmail.com> wrote: > > Is it possible in graphx to create/generate graph of n x n given only the > vertices. > On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote: > >> Is it possible in graphx to create/generate a graph n x n given n >> vertices? >> > >
Re: Create a n x n graph given only the vertices no
Is it possible in graphx to create/generate graph of n x n given only the vertices. On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote: > Is it possible in graphx to create/generate a graph n x n given n > vertices? >
Create a n x n graph given only the vertices
Is it possible in graphx to create/generate a graph n x n given n vertices?
Regarding rdd.collect()
When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors?
Meaning of local[2]
What does this mean in .setMaster(local[2]) Is this applicable only for standalone Mode? Can I do this in a cluster setup, eg: . setMaster(hostname:port[2]).. Is it number of threads per worker node?
StringIndexer + VectorAssembler equivalent to HashingTF?
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting the document for analysis?
Re: Spark MLib v/s SparkR
I am starting off with classification models, Logistic,RandomForest. Basically wanted to learn Machine learning. Since I have a java background I started off with MLib, but later heard R works as well ( with scaling issues - only). So, with SparkR was wondering the scaling issue would be resolved - hence my question why not go with R and Spark R alone.( keeping aside my inclination towards java) On Thu, Aug 6, 2015 at 12:28 AM, Charles Earl charles.ce...@gmail.com wrote: What machine learning algorithms are you interested in exploring or using? Start from there or better yet the problem you are trying to solve, and then the selection may be evident. On Wednesday, August 5, 2015, praveen S mylogi...@gmail.com wrote: I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages of Spark MLib over Spark R or advantages of SparkR over MLib? -- - Charles
Spark MLib v/s SparkR
I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages of Spark MLib over Spark R or advantages of SparkR over MLib?
Difference between RandomForestModel and RandomForestClassificationModel
Hi Wanted to know what is the difference between RandomForestModel and RandomForestClassificationModel? in Mlib.. Will they yield the same results for a given dataset?
SparkR and Spark Mlib
Hi, Is sparkR and spark Mlib same?