No of Spark context per jvm

2016-05-09 Thread praveen S
Hi,

As far as I know you can create one SparkContext per jvm,  but wanted to
confirm if it's one per jvm or one per classloader. As in one SparkContext
created per *. war, all deployment under one tomcat instance

Regards,
Praveen


Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Sorry.. Rephrasing :
Can this issue be resolved by having a smaller block interval?

Regards,
Praveen
On 18 Feb 2016 21:30, "praveen S" <mylogi...@gmail.com> wrote:

> Can having a smaller block interval only resolve this?
>
> Regards,
> Praveen
> On 18 Feb 2016 21:13, "Cody Koeninger" <c...@koeninger.org> wrote:
>
>> Backpressure won't help you with the first batch, you'd need 
>> spark.streaming.kafka.maxRatePerPartition
>> for that
>>
>> On Thu, Feb 18, 2016 at 9:40 AM, praveen S <mylogi...@gmail.com> wrote:
>>
>>> Have a look at
>>>
>>> spark.streaming.backpressure.enabled
>>> Property
>>>
>>> Regards,
>>> Praveen
>>> On 18 Feb 2016 00:13, "Abhishek Anand" <abhis.anan...@gmail.com> wrote:
>>>
>>>> I have a spark streaming application running in production. I am trying
>>>> to find a solution for a particular use case when my application has a
>>>> downtime of say 5 hours and is restarted. Now, when I start my streaming
>>>> application after 5 hours there would be considerable amount of data then
>>>> in the Kafka and my cluster would be unable to repartition and process 
>>>> that.
>>>>
>>>> Is there any workaround so that when my streaming application starts it
>>>> starts taking data for 1-2 hours, process it , then take the data for next
>>>> 1 hour process it. Now when its done processing of previous 5 hours data
>>>> which missed, normal streaming should start with the given slide interval.
>>>>
>>>> Please suggest any ideas and feasibility of this.
>>>>
>>>>
>>>> Thanks !!
>>>> Abhi
>>>>
>>>
>>


Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Can having a smaller block interval only resolve this?

Regards,
Praveen
On 18 Feb 2016 21:13, "Cody Koeninger" <c...@koeninger.org> wrote:

> Backpressure won't help you with the first batch, you'd need 
> spark.streaming.kafka.maxRatePerPartition
> for that
>
> On Thu, Feb 18, 2016 at 9:40 AM, praveen S <mylogi...@gmail.com> wrote:
>
>> Have a look at
>>
>> spark.streaming.backpressure.enabled
>> Property
>>
>> Regards,
>> Praveen
>> On 18 Feb 2016 00:13, "Abhishek Anand" <abhis.anan...@gmail.com> wrote:
>>
>>> I have a spark streaming application running in production. I am trying
>>> to find a solution for a particular use case when my application has a
>>> downtime of say 5 hours and is restarted. Now, when I start my streaming
>>> application after 5 hours there would be considerable amount of data then
>>> in the Kafka and my cluster would be unable to repartition and process that.
>>>
>>> Is there any workaround so that when my streaming application starts it
>>> starts taking data for 1-2 hours, process it , then take the data for next
>>> 1 hour process it. Now when its done processing of previous 5 hours data
>>> which missed, normal streaming should start with the given slide interval.
>>>
>>> Please suggest any ideas and feasibility of this.
>>>
>>>
>>> Thanks !!
>>> Abhi
>>>
>>
>


Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Have a look at

spark.streaming.backpressure.enabled
Property

Regards,
Praveen
On 18 Feb 2016 00:13, "Abhishek Anand"  wrote:

> I have a spark streaming application running in production. I am trying to
> find a solution for a particular use case when my application has a
> downtime of say 5 hours and is restarted. Now, when I start my streaming
> application after 5 hours there would be considerable amount of data then
> in the Kafka and my cluster would be unable to repartition and process that.
>
> Is there any workaround so that when my streaming application starts it
> starts taking data for 1-2 hours, process it , then take the data for next
> 1 hour process it. Now when its done processing of previous 5 hours data
> which missed, normal streaming should start with the given slide interval.
>
> Please suggest any ideas and feasibility of this.
>
>
> Thanks !!
> Abhi
>


Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread praveen S
Even i was trying to launch spark jobs from webservice :

But I thought you could run spark jobs in yarn mode only through
spark-submit. Is my understanding not correct?

Regards,
Praveen
On 15 Feb 2016 08:29, "Sabarish Sasidharan" 
wrote:

> Yes you can look at using the capacity scheduler or the fair scheduler
> with YARN. Both allow using full cluster when idle. And both allow
> considering cpu plus memory when allocating resources which is sort of
> necessary with Spark.
>
> Regards
> Sab
> On 13-Feb-2016 10:11 pm, "Eugene Morozov" 
> wrote:
>
>> Hi,
>>
>> I have several instances of the same web-service that is running some ML
>> algos on Spark (both training and prediction) and do some Spark unrelated
>> job. Each web-service instance creates their on JavaSparkContext, thus
>> they're seen as separate applications by Spark, thus they're configured
>> with separate limits of resources such as cores (I'm not concerned about
>> the memory as much as about cores).
>>
>> With this set up, say 3 web service instances, each of them has just 1/3
>> of cores. But it might happen, than only one instance is going to use
>> Spark, while others are busy with Spark unrelated. I'd like in this case
>> all Spark cores be available for the one that's in need.
>>
>> Ideally I'd like Spark cores just be available in total and the first app
>> who needs it, takes as much as required from the available at the moment.
>> Is it possible? I believe Mesos is able to set resources free if they're
>> not in use. Is it possible with YARN?
>>
>> I'd appreciate if you could share your thoughts or experience on the
>> subject.
>>
>> Thanks.
>> --
>> Be well!
>> Jean Morozov
>>
>


AM creation in yarn client mode

2016-02-09 Thread praveen S
Hi,

I have 2 questions when running the spark jobs on yarn in client mode :

1) Where is the AM(application master) created :

A) is it created on the client where the job was submitted? i.e driver and
AM on the same client?
Or
B) yarn decides where the the AM should be created?

2) Driver and AM run in different processes : is my assumption correct?

Regards,
Praveen


Re: AM creation in yarn-client mode

2016-02-09 Thread praveen S
Can you explain what happens in yarn client mode?

Regards,
Praveen
On 10 Feb 2016 10:55, "ayan guha" <guha.a...@gmail.com> wrote:

> It depends on yarn-cluster and yarn-client mode.
>
> On Wed, Feb 10, 2016 at 3:42 PM, praveen S <mylogi...@gmail.com> wrote:
>
>> Hi,
>>
>> I have 2 questions when running the spark jobs on yarn in client mode :
>>
>> 1) Where is the AM(application master) created :
>>
>> A) is it created on the client where the job was submitted? i.e driver
>> and AM on the same client?
>> Or
>> B) yarn decides where the the AM should be created?
>>
>> 2) Driver and AM run in different processes : is my assumption correct?
>>
>> Regards,
>> Praveen
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Create a n x n graph given only the vertices no

2016-01-20 Thread praveen S
Hi Robin,

I am using Spark 1.3 and I am not able to find the api
Graph.fromEdgeTuples(edge RDD, 1)

Regards,
Praveen
Well you can use a similar tech to generate an RDD[(Long, Long)] (that’s
what the edges variable is) and then create the Graph using
Graph.fromEdgeTuples.
---
Robin East
*Spark GraphX in Action* Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 11 Jan 2016, at 12:30, praveen S <mylogi...@gmail.com> wrote:

Yes I was looking something of that sort..  Thank you.

Actually I was looking for a way to connect nodes based on the property of
the nodes.. I have a set of nodes and I know the condition on which I can
create an edge..
On 11 Jan 2016 14:06, "Robin East" <robin.e...@xense.co.uk> wrote:

> Do you mean to create a perfect graph? You can do it like this:
>
> scala> import org.apache.spark.graphx._
> import org.apache.spark.graphx._
>
> scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L))
> vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at
> parallelize at :36
>
> scala> val edges = vert.cartesian(vert)
> edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at
> cartesian at :38
>
> scala> val g = Graph.fromEdgeTuples(edges, 1)
>
>
> That will give you a graph where all vertices connect to every other
> vertices. it will give you self joins as well i.e. vert 1 has an edge to
> vert 1, you can deal with this using a filter on the edges RDD if you don’t
> want self-joins
>
> --
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 11 Jan 2016, at 03:19, praveen S <mylogi...@gmail.com> wrote:
>
> Is it possible in graphx to create/generate graph of n x n given only the
> vertices.
> On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote:
>
>> Is it possible in graphx to create/generate a graph n x n given n
>> vertices?
>>
>
>


Re: Create a n x n graph given only the vertices no

2016-01-20 Thread praveen S
Sorry.. Found the api..
On 21 Jan 2016 10:17, "praveen S" <mylogi...@gmail.com> wrote:

> Hi Robin,
>
> I am using Spark 1.3 and I am not able to find the api
> Graph.fromEdgeTuples(edge RDD, 1)
>
> Regards,
> Praveen
> Well you can use a similar tech to generate an RDD[(Long, Long)] (that’s
> what the edges variable is) and then create the Graph using
> Graph.fromEdgeTuples.
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 11 Jan 2016, at 12:30, praveen S <mylogi...@gmail.com> wrote:
>
> Yes I was looking something of that sort..  Thank you.
>
> Actually I was looking for a way to connect nodes based on the property of
> the nodes.. I have a set of nodes and I know the condition on which I can
> create an edge..
> On 11 Jan 2016 14:06, "Robin East" <robin.e...@xense.co.uk> wrote:
>
>> Do you mean to create a perfect graph? You can do it like this:
>>
>> scala> import org.apache.spark.graphx._
>> import org.apache.spark.graphx._
>>
>> scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L))
>> vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at
>> parallelize at :36
>>
>> scala> val edges = vert.cartesian(vert)
>> edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at
>> cartesian at :38
>>
>> scala> val g = Graph.fromEdgeTuples(edges, 1)
>>
>>
>> That will give you a graph where all vertices connect to every other
>> vertices. it will give you self joins as well i.e. vert 1 has an edge to
>> vert 1, you can deal with this using a filter on the edges RDD if you don’t
>> want self-joins
>>
>> ------
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 11 Jan 2016, at 03:19, praveen S <mylogi...@gmail.com> wrote:
>>
>> Is it possible in graphx to create/generate graph of n x n given only the
>> vertices.
>> On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote:
>>
>>> Is it possible in graphx to create/generate a graph n x n given n
>>> vertices?
>>>
>>
>>
>


Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread praveen S
Can you give me more details on Spark's jobserver.

Regards,
Praveen
On 18 Jan 2016 03:30, "Jia"  wrote:

> I guess all jobs submitted through JobServer are executed in the same JVM,
> so RDDs cached by one job can be visible to all other jobs executed later.
> On Jan 17, 2016, at 3:56 PM, Mark Hamstra  wrote:
>
> Yes, that is one of the basic reasons to use a
> jobserver/shared-SparkContext.  Otherwise, in order share the data in an
> RDD you have to use an external storage system, such as a distributed
> filesystem or Tachyon.
>
> On Sun, Jan 17, 2016 at 1:52 PM, Jia  wrote:
>
>> Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem,
>> so that jobs can be submitted at different time and still share RDDs.
>>
>> Best Regards,
>> Jia
>>
>>
>> On Jan 17, 2016, at 3:44 PM, Mark Hamstra 
>> wrote:
>>
>> There is a 1-to-1 relationship between Spark Applications and
>> SparkContexts -- fundamentally, a Spark Applications is a program that
>> creates and uses a SparkContext, and that SparkContext is destroyed when
>> then Application ends.  A jobserver generically and the Spark JobServer
>> specifically is an Application that keeps a SparkContext open for a long
>> time and allows many Jobs to be be submitted and run using that shared
>> SparkContext.
>>
>> More than one Application/SparkContext unavoidably implies more than one
>> JVM process per Worker -- Applications/SparkContexts cannot share JVM
>> processes.
>>
>> On Sun, Jan 17, 2016 at 1:15 PM, Jia  wrote:
>>
>>> Hi, Mark, sorry for the confusion.
>>>
>>> Let me clarify, when an application is submitted, the master will tell
>>> each Spark worker to spawn an executor JVM process. All the task sets  of
>>> the application will be executed by the executor. After the application
>>> runs to completion. The executor process will be killed.
>>> But I hope that all applications submitted can run in the same executor,
>>> can JobServer do that? If so, it’s really good news!
>>>
>>> Best Regards,
>>> Jia
>>>
>>> On Jan 17, 2016, at 3:09 PM, Mark Hamstra 
>>> wrote:
>>>
>>> You've still got me confused.  The SparkContext exists at the Driver,
>>> not on an Executor.
>>>
>>> Many Jobs can be run by a SparkContext -- it is a common pattern to use
>>> something like the Spark Jobserver where all Jobs are run through a shared
>>> SparkContext.
>>>
>>> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou 
>>> wrote:
>>>
 Hi, Mark, sorry, I mean SparkContext.
 I mean to change Spark into running all submitted jobs (SparkContexts)
 in one executor JVM.

 Best Regards,
 Jia

 On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra 
 wrote:

> -dev
>
> What do you mean by JobContext?  That is a Hadoop mapreduce concept,
> not Spark.
>
> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou 
> wrote:
>
>> Dear all,
>>
>> Is there a way to reuse executor JVM across different JobContexts?
>> Thanks.
>>
>> Best Regards,
>> Jia
>>
>
>

>>>
>>>
>>
>>
>
>


Usage of SparkContext within a Web container

2016-01-13 Thread praveen S
Is use of SparkContext from a Web container a right way to process spark
jobs or should we use spark-submit in a processbuilder?

Are there any pros or cons of using SparkContext from a Web container..?

How does zeppelin trigger spark jobs from the Web context?


Re: Create a n x n graph given only the vertices no

2016-01-11 Thread praveen S
Yes I was looking something of that sort..  Thank you.

Actually I was looking for a way to connect nodes based on the property of
the nodes.. I have a set of nodes and I know the condition on which I can
create an edge..
On 11 Jan 2016 14:06, "Robin East" <robin.e...@xense.co.uk> wrote:

> Do you mean to create a perfect graph? You can do it like this:
>
> scala> import org.apache.spark.graphx._
> import org.apache.spark.graphx._
>
> scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L))
> vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at
> parallelize at :36
>
> scala> val edges = vert.cartesian(vert)
> edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at
> cartesian at :38
>
> scala> val g = Graph.fromEdgeTuples(edges, 1)
>
>
> That will give you a graph where all vertices connect to every other
> vertices. it will give you self joins as well i.e. vert 1 has an edge to
> vert 1, you can deal with this using a filter on the edges RDD if you don’t
> want self-joins
>
> --
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 11 Jan 2016, at 03:19, praveen S <mylogi...@gmail.com> wrote:
>
> Is it possible in graphx to create/generate graph of n x n given only the
> vertices.
> On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote:
>
>> Is it possible in graphx to create/generate a graph n x n given n
>> vertices?
>>
>
>


Re: Create a n x n graph given only the vertices no

2016-01-10 Thread praveen S
Is it possible in graphx to create/generate graph of n x n given only the
vertices.
On 8 Jan 2016 23:57, "praveen S" <mylogi...@gmail.com> wrote:

> Is it possible in graphx to create/generate a graph n x n given n
> vertices?
>


Create a n x n graph given only the vertices

2016-01-08 Thread praveen S
Is it possible in graphx to create/generate a graph n x n given n vertices?


Regarding rdd.collect()

2015-08-18 Thread praveen S
When I do an rdd.collect().. The data moves back to driver  Or is still
held in memory across the executors?


Meaning of local[2]

2015-08-17 Thread praveen S
What does this mean in .setMaster(local[2])

Is this applicable only for standalone Mode?

Can I do this in a cluster setup, eg:
. setMaster(hostname:port[2])..

Is it number of threads per worker node?


StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread praveen S
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting
the document for analysis?


Re: Spark MLib v/s SparkR

2015-08-06 Thread praveen S
I am starting off with classification models, Logistic,RandomForest.
Basically wanted to learn Machine learning.
Since I have a java background I started off with MLib, but later heard R
works as well ( with scaling issues - only).

So, with SparkR was wondering the scaling issue would be resolved - hence
my question why not go with R and Spark R alone.( keeping aside my
inclination towards java)

On Thu, Aug 6, 2015 at 12:28 AM, Charles Earl charles.ce...@gmail.com
wrote:

 What machine learning algorithms are you interested in exploring or using?
 Start from there or better yet the problem you are trying to solve, and
 then the selection may be evident.


 On Wednesday, August 5, 2015, praveen S mylogi...@gmail.com wrote:

 I was wondering when one should go for MLib or SparkR. What is the
 criteria or what should be considered before choosing either of the
 solutions for data analysis?
 or What is the advantages of Spark MLib over Spark R or advantages of
 SparkR over MLib?



 --
 - Charles



Spark MLib v/s SparkR

2015-08-05 Thread praveen S
I was wondering when one should go for MLib or SparkR. What is the criteria
or what should be considered before choosing either of the solutions for
data analysis?
or What is the advantages of Spark MLib over Spark R or advantages of
SparkR over MLib?


Difference between RandomForestModel and RandomForestClassificationModel

2015-07-29 Thread praveen S
Hi
Wanted to know what is the difference between
RandomForestModel and RandomForestClassificationModel?
in Mlib.. Will they yield the same results for a given dataset?


SparkR and Spark Mlib

2015-07-03 Thread praveen S
Hi,
Is sparkR and spark Mlib same?