Re: [openstack-dev] [nova] Configure overcommit policy

2013-11-13 Thread Alexander Kuznetsov
Toan and Alex. Having separate computes pools for Hadoop is not suitable if
we want to use  an unused power of OpenStack cluster to run Hadoop analytic
 jobs. Possibly in this case it is better to modify the over-commit
calculation in the scheduler according John suggestion.


On Tue, Nov 12, 2013 at 7:16 PM, Khanh-Toan Tran <
khanh-toan.t...@cloudwatt.com> wrote:

> FYI, by default Openstack overcommit CPU 1:16, meaning it can host 16
> times number of cores it possesses. As mentioned Alex, you can change it by
> enabling AggregateCoreFilter in nova.conf:
>scheduler_default_filters =  adding AggregateCoreFilter here>
>
> and modifying the overcommit ratio by adding:
>   cpu_allocation_ratio=1.0
>
> Just a suggestion, think of isolating the host for the tenant that uses
> Hadoop so that it will not serve other applications. You have several
> filters at your disposal:
>  AggregateInstanceExtraSpecsFilter
>  IsolatedHostsFilter
>  AggregateMultiTenancyIsolation
>
> Best regards,
>
> Toan
>
> --
> *From: *"Alex Glikson" 
>
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev@lists.openstack.org>
> *Sent: *Tuesday, November 12, 2013 3:54:02 PM
>
> *Subject: *Re: [openstack-dev] [nova] Configure overcommit policy
>
> You can consider having a separate host aggregate for Hadoop, and use a
> combination of AggregateInstanceExtraSpecFilter (with a special flavor
> mapped to this host aggregate) and AggregateCoreFilter (overriding
> cpu_allocation_ratio for this host aggregate to be 1).
>
> Regards,
> Alex
>
>
>
>
> From:John Garbutt 
> To:"OpenStack Development Mailing List (not for usage questions)"
> ,
> Date:    12/11/2013 04:41 PM
> Subject:Re: [openstack-dev] [nova] Configure overcommit policy
> --
>
>
>
> On 11 November 2013 12:04, Alexander Kuznetsov 
> wrote:
> > Hi all,
> >
> > While studying Hadoop performance in a virtual environment, I found an
> > interesting problem with Nova scheduling. In OpenStack cluster, we have
> > overcommit policy, allowing to put on one compute more vms than resources
> > available for them. While it might be suitable for general types of
> > workload, this is definitely not the case for Hadoop clusters, which
> usually
> > consume 100% of system resources.
> >
> > Is there any way to tell Nova to schedule specific instances (the ones
> which
> > consume 100% of system resources) without overcommitting resources on
> > compute node?
>
> You could have a flavor with "no-overcommit" extra spec, and modify
> the over-commit calculation in the scheduler on that case, but I don't
> remember seeing that in there.
>
> John
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> 
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Configure overcommit policy

2013-11-11 Thread Alexander Kuznetsov
Hi all,


While studying Hadoop performance in a virtual environment, I found an
interesting problem with Nova scheduling. In OpenStack cluster, we have
overcommit policy, allowing to put on one compute more vms than resources
available for them. While it might be suitable for general types of
workload, this is definitely not the case for Hadoop clusters, which
usually consume 100% of system resources.

Is there any way to tell Nova to schedule specific instances (the ones
which consume 100% of system resources) without overcommitting resources on
compute node?


Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] [savanna] [trove] Place for software configuration

2013-11-01 Thread Alexander Kuznetsov
On Fri, Nov 1, 2013 at 12:39 AM, Clint Byrum  wrote:

> Excerpts from Alexander Kuznetsov's message of 2013-10-31 10:51:54 -0700:
> > Hi Heat, Savanna and Trove teams,
> >
> > All this projects have common part related to software configuration
> > management.  For creation,  an environment  user should specify a
> hardware
> > parameter for vms:  choose flavor, decide use cinder or not, configure
> > networks for virtual machines, choose topology for hole deployment. Next
> > step is linking of software parameters with hardware specification. From
> > the end user point of view, existence of three different places and three
> > different ways (HEAT Hot DSL, Trove clustering API and Savanna Hadoop
> > templates) for software configuration is not convenient, especially if
> user
> > want to create an environment simultaneously involving components from
> > Savanna, Heat and Trove.
> >
>
> I'm having a hard time extracting the problem statement. I _think_ that
> the problem is:
>
> As a user I want to tune my software for my available hardware.
>
> So what you're saying is, if you select a flavor that has 4GB of RAM
> for your application, you want to also tell your application that it
> can use 3GB of RAM for an in-memory cache. Likewise, if one has asked
> Trove for an 8GB flavor, they will want to tell it to use 6.5GB of RAM
> for InnoDB buffer cache.
>
> What you'd like to see is one general pattern to express these types
> of things?
>
Exactly.

>
> > I can suggest two approaches to overcome this situations:
> >
> > Common library in oslo. This approach allows a deep domain specific
> > customization. The user will still have 3 places with same UI where user
> > should perform configuration actions.
> >
> > Heat or some other component for software configuration management. This
> > approach is the best for end users. In feature possible will be some
> > limitation on deep domain specific customization for configuration
> > management.
>
> Can you maybe be more concrete with your proposed solutions? The lack
> of a clear problem statement combined with these vague solutions has
> thoroughly confused me.
>
>
> Sure. I suggest creating a some library or component for standardization
of  software and hardware configuration. It will contain a validation logic
and parameters lists.

Now Trove, Savanna and Heat all have part related to hardware
configuration. For end user, VMs description should not depend on component
where it will be used.

Here is an example of VM description which could be common for Savanna and
Trove:

{
   flavor_id: 42,
   image_id: ”test”,
   volumes: [{
   # extra contains a domain specific parameters.
   # For instance aim for Savanna
   # could be hdfs-dir or mapreduce-dir.
   # For trove: journal-dir or db-dir.
   extra: {
   aim: hdfs-dir
   },
   size: 10GB,
   filesystem: ext3
 },{
   extra: {
 aim: mapreduce-dir
   },
   size: 5GB,
   filesystem: ext3
 }]
networks: [{
   private-network: some-private-net-id,
   public-network: some-public-net-id
 }]


Also, it will be great if this library or component will standardize some
software configuration parameters, like a credential for database or LDAP.
This greatly simplify integration between different components. For example
if user want process data on Hadoop from Cassandra, user should provide a
database location and credentials to Hadoop. If we have some standard for
both Trove and Savanna, it can be done the same way in both components. An
example for Cassandra could look like that:


{
  type: cassandra,
  host: example.com,
  port: 1234,
  credentials: {
 user: ”test”,
 password: ”123”
  }
}


This parameters names and schema should be the same for different
components referencing a Cassandra server.
>
> 
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] [savanna] [trove] Place for software configuration

2013-11-01 Thread Alexander Kuznetsov
Jay. Do you have a plan to add a Savanna (type: Heat::Savanna) and Trove
 (type: Heat::Trove)  providers to the HOT DSL?


On Thu, Oct 31, 2013 at 10:33 PM, Jay Pipes  wrote:

> On 10/31/2013 01:51 PM, Alexander Kuznetsov wrote:
>
>> Hi Heat, Savanna and Trove teams,
>>
>> All this projects have common part related to software configuration
>> management.  For creation,  an environment  user should specify a
>> hardware parameter for vms:  choose flavor, decide use cinder or not,
>> configure networks for virtual machines, choose topology for hole
>> deployment. Next step is linking of software parameters with hardware
>> specification. From the end user point of view, existence of three
>> different places and three different ways (HEAT Hot DSL, Trove
>> clustering API and Savanna Hadoop templates) for software configuration
>> is not convenient, especially if user want to create an environment
>> simultaneously involving components from Savanna, Heat and Trove.
>>
>> I can suggest two approaches to overcome this situations:
>>
>> Common library in oslo. This approach allows a deep domain specific
>> customization. The user will still have 3 places with same UI where user
>> should perform configuration actions.
>>
>> Heat or some other component for software configuration management. This
>> approach is the best for end users. In feature possible will be some
>> limitation on deep domain specific customization for configuration
>> management.
>>
>
> I think this would be my preference.
>
> In other words, describe and orchestrate a Hadoop or Database setup using
> HOT templates and using Heat as the orchestration engine.
>
> Best,
> -jay
>
>  Heat, Savanna and Trove teams can you comment these ideas, what approach
>> are the best?
>>
>> Alexander Kuznetsov.
>>
>>
>> __**_
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.**org 
>> http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-dev<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>>
>>
>
> __**_
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.**org 
> http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-dev<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [heat] [savanna] [trove] Place for software configuration

2013-10-31 Thread Alexander Kuznetsov
Hi Heat, Savanna and Trove teams,

All this projects have common part related to software configuration
management.  For creation,  an environment  user should specify a hardware
parameter for vms:  choose flavor, decide use cinder or not, configure
networks for virtual machines, choose topology for hole deployment. Next
step is linking of software parameters with hardware specification. From
the end user point of view, existence of three different places and three
different ways (HEAT Hot DSL, Trove clustering API and Savanna Hadoop
templates) for software configuration is not convenient, especially if user
want to create an environment simultaneously involving components from
Savanna, Heat and Trove.

I can suggest two approaches to overcome this situations:

Common library in oslo. This approach allows a deep domain specific
customization. The user will still have 3 places with same UI where user
should perform configuration actions.

Heat or some other component for software configuration management. This
approach is the best for end users. In feature possible will be some
limitation on deep domain specific customization for configuration
management.

Heat, Savanna and Trove teams can you comment these ideas, what approach
are the best?

Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [savanna] Program name and Mission statement

2013-09-16 Thread Alexander Kuznetsov
Another variant *Big Data Processing*. This mission more precise reflects
the Savanna nature as just Data Processing. Also, this name is less
confusing as just Data Processing.


On Tue, Sep 17, 2013 at 12:11 AM, Mike Spreitzer wrote:

> "data processing" is surely a superset of "big data".  Either, by itself,
> is way too vague.  But the wording that many people favor, which I will
> quote again, uses the vague term in a qualified way that makes it
> appropriately specific, IMHO.  Here is the wording again:
>
> ``To provide a simple, reliable and repeatable mechanism by which to
> deploy Hadoop and related Big Data projects, including management,
> monitoring and processing mechanisms driving further adoption of
> OpenStack.''
>
> I think that saying "related Big Data projects" after "Hadoop" is fairly
> clear.  OTOH, I would not mind replacing "Hadoop and related Big Data
> projects" with "the Hadoop ecosystem".
>
> Regards,
> Mike
>
> Matthew Farrellee  wrote on 09/16/2013 02:39:20 PM:
>
> > From: Matthew Farrellee 
> > To: OpenStack Development Mailing List <
> openstack-dev@lists.openstack.org>,
> > Date: 09/16/2013 02:40 PM
> > Subject: Re: [openstack-dev] [savanna] Program name and Mission statement
> >
> > IMHO, Big Data is even more nebulous and currently being pulled in many
> > directions. Hadoop-as-a-Service may be too narrow. So, something in
> > between, such as Data Processing, is a good balance.
> >
> > Best,
> >
> >
> > matt
> >
> > On 09/13/2013 08:37 AM, Abhishek Lahiri wrote:
> > > IMHO data processing is too board , it makes more sense to clarify this
> > > program as big data as a service or simply
> openstack-Hadoop-as-a-service.
> > >
> > > Thanks & Regards
> > > Abhishek Lahiri
> > >
> > > On Sep 12, 2013, at 9:13 PM, Nirmal Ranganathan  > > <mailto:rnir...@gmail.com >> wrote:
> > >
> > >>
> > >>
> > >>
> > >> On Wed, Sep 11, 2013 at 8:39 AM, Erik Bergenholtz
> > >>  > >> <mailto:ebergenho...@hortonworks.com
> >>
> > >> wrote:
> > >>
> > >>
> > >> On Sep 10, 2013, at 8:50 PM, Jon Maron  > >> <mailto:jma...@hortonworks.com >> wrote:
> > >>
> > >>> Openstack Big Data Platform
> > >>>
> > >>>
> > >>> On Sep 10, 2013, at 8:39 PM, David Scott
> > >>>  > >>> <mailto:david.sc...@cloudscaling.com>>
> wrote:
> > >>>
> > >>>> I vote for 'Open Stack Data'
> > >>>>
> > >>>>
> > >>>> On Tue, Sep 10, 2013 at 5:30 PM, Zhongyue Luo
> > >>>>  > >>>> <mailto:zhongyue@intel.com>>
> wrote:
> > >>>>
> > >>>> Why not "OpenStack MapReduce"? I think that pretty much says
> > >>>> it all?
> > >>>>
> > >>>>
> > >>>> On Wed, Sep 11, 2013 at 3:54 AM, Glen Campbell
> > >>>> mailto:g...@glenc.io >>
> wrote:
> > >>>>
> > >>>> "performant" isn't a word. Or, if it is, it means
> > >>>> "having performance." I think you mean
> "high-performance."
> > >>>>
> > >>>>
> > >>>> On Tue, Sep 10, 2013 at 8:47 AM, Matthew Farrellee
> > >>>> mailto:m...@redhat.com>>
> wrote:
> > >>>>
> > >>>> Rough cut -
> > >>>>
> > >>>> Program: OpenStack Data Processing
> > >>>> Mission: To provide the OpenStack community with an
> > >>>> open, cutting edge, performant and scalable data
> > >>>> processing stack and associated management
> interfaces.
> > >>>>
> > >>
> > >> Proposing a slightly different mission:
> > >>
> > >> To provide a simple, reliable and repeatable mechanism by which to
> > >> deploy Hadoop and related Big Data projects, including management,
> > >> monitoring and processing mechanisms driving further adoption of
> > >>

Re: [openstack-dev] TC Meeting / Savanna Incubation Follow-Up

2013-09-13 Thread Alexander Kuznetsov
Hadoop Ecosystem is not only datastore technologies. Hadoop has other
components:  Map Reduce framework, distributed coordinator - Zookepeer,
workflow management - Oozie, runtime for scripting languages - Hive and
Pig, scalable machine learning library - Apache Mahout. All this components
are tightly coupled together and datastore part can't be considered
separately, from other component. This is a the main reason why for Hadoop
installation and management are required  a separate solution, distinct
from generic enough™ datastore API. In the other case, this API will
contain a huge part, not relating to datastore technologies.


On Fri, Sep 13, 2013 at 8:17 PM, Michael Basnight wrote:

>
> On Sep 13, 2013, at 9:05 AM, Alexander Kuznetsov wrote:
>
> >
> >
> >
> > On Fri, Sep 13, 2013 at 7:26 PM, Michael Basnight 
> wrote:
> > On Sep 13, 2013, at 6:56 AM, Alexander Kuznetsov wrote:
> > > On Thu, Sep 12, 2013 at 7:30 PM, Michael Basnight 
> wrote:
> > > On Sep 12, 2013, at 2:39 AM, Thierry Carrez wrote:
> > >
> > > > Sergey Lukjanov wrote:
> > > >
> > > >> [...]
> > > >> As you can see, resources provisioning is just one of the features
> and the implementation details are not critical for overall architecture.
> It performs only the first step of the cluster setup. We’ve been
> considering Heat for a while, but ended up direct API calls in favor of
> speed and simplicity. Going forward Heat integration will be done by
> implementing extension mechanism [3] and [4] as part of Icehouse release.
> > > >>
> > > >> The next part, Hadoop cluster configuration, already extensible and
> we have several plugins - Vanilla, Hortonworks Data Platform and Cloudera
> plugin started too. This allow to unify management of different Hadoop
> distributions under single control plane. The plugins are responsible for
> correct Hadoop ecosystem configuration at already provisioned resources and
> use different Hadoop management tools like Ambari to setup and configure
> all cluster  services, so, there are no actual provisioning configs on
> Savanna side in this case. Savanna and its plugins encapsulate the
> knowledge of Hadoop internals and default configuration for Hadoop services.
> > > >
> > > > My main gripe with Savanna is that it combines (in its upcoming
> release)
> > > > what sounds like to me two very different services: Hadoop cluster
> > > > provisioning service (like what Trove does for databases) and a
> > > > MapReduce+ data API service (like what Marconi does for queues).
> > > >
> > > > Making it part of the same project (rather than two separate
> projects,
> > > > potentially sharing the same program) make discussions about shifting
> > > > some of its clustering ability to another library/project more
> complex
> > > > than they should be (see below).
> > > >
> > > > Could you explain the benefit of having them within the same service,
> > > > rather than two services with one consuming the other ?
> > >
> > > And for the record, i dont think that Trove is the perfect fit for it
> today. We are still working on a clustering API. But when we create it, i
> would love the Savanna team's input, so we can try to make a pluggable API
> thats usable for people who want MySQL or Cassandra or even Hadoop. Im less
> a fan of a clustering library, because in the end, we will both have API
> calls like POST /clusters, GET /clusters, and there will be API duplication
> between the projects.
> > >
> > > I think that Cluster API (if it would be created) will be helpful not
> only for Trove and Savanna.  NoSQL, RDBMS and Hadoop are not unique
> software which can be clustered. What about different kind of messaging
> solutions like RabbitMQ, ActiveMQ or J2EE containers like JBoss, Weblogic
> and WebSphere, which often are installed in clustered mode. Messaging,
> databases, J2EE containers and Hadoop have their own management cycle. It
> will be confusing to make Cluster API a part of Trove which has different
> mission - database management and provisioning.
> >
> > Are you suggesting a 3rd program, cluster as a service? Trove is trying
> to target a generic enough™ API to tackle different technologies with
> plugins or some sort of extensions. This will include a scheduler to
> determine rack awareness. Even if we decide that both Savanna and Trove
> need their own API for building clusters, I still want to understand what
> makes the Savanna API and implementation different, and how Trove can build
> an API/system that can encompass multipl

Re: [openstack-dev] TC Meeting / Savanna Incubation Follow-Up

2013-09-13 Thread Alexander Kuznetsov
On Fri, Sep 13, 2013 at 7:26 PM, Michael Basnight wrote:

> On Sep 13, 2013, at 6:56 AM, Alexander Kuznetsov wrote:
> > On Thu, Sep 12, 2013 at 7:30 PM, Michael Basnight 
> wrote:
> > On Sep 12, 2013, at 2:39 AM, Thierry Carrez wrote:
> >
> > > Sergey Lukjanov wrote:
> > >
> > >> [...]
> > >> As you can see, resources provisioning is just one of the features
> and the implementation details are not critical for overall architecture.
> It performs only the first step of the cluster setup. We’ve been
> considering Heat for a while, but ended up direct API calls in favor of
> speed and simplicity. Going forward Heat integration will be done by
> implementing extension mechanism [3] and [4] as part of Icehouse release.
> > >>
> > >> The next part, Hadoop cluster configuration, already extensible and
> we have several plugins - Vanilla, Hortonworks Data Platform and Cloudera
> plugin started too. This allow to unify management of different Hadoop
> distributions under single control plane. The plugins are responsible for
> correct Hadoop ecosystem configuration at already provisioned resources and
> use different Hadoop management tools like Ambari to setup and configure
> all cluster  services, so, there are no actual provisioning configs on
> Savanna side in this case. Savanna and its plugins encapsulate the
> knowledge of Hadoop internals and default configuration for Hadoop services.
> > >
> > > My main gripe with Savanna is that it combines (in its upcoming
> release)
> > > what sounds like to me two very different services: Hadoop cluster
> > > provisioning service (like what Trove does for databases) and a
> > > MapReduce+ data API service (like what Marconi does for queues).
> > >
> > > Making it part of the same project (rather than two separate projects,
> > > potentially sharing the same program) make discussions about shifting
> > > some of its clustering ability to another library/project more complex
> > > than they should be (see below).
> > >
> > > Could you explain the benefit of having them within the same service,
> > > rather than two services with one consuming the other ?
> >
> > And for the record, i dont think that Trove is the perfect fit for it
> today. We are still working on a clustering API. But when we create it, i
> would love the Savanna team's input, so we can try to make a pluggable API
> thats usable for people who want MySQL or Cassandra or even Hadoop. Im less
> a fan of a clustering library, because in the end, we will both have API
> calls like POST /clusters, GET /clusters, and there will be API duplication
> between the projects.
> >
> > I think that Cluster API (if it would be created) will be helpful not
> only for Trove and Savanna.  NoSQL, RDBMS and Hadoop are not unique
> software which can be clustered. What about different kind of messaging
> solutions like RabbitMQ, ActiveMQ or J2EE containers like JBoss, Weblogic
> and WebSphere, which often are installed in clustered mode. Messaging,
> databases, J2EE containers and Hadoop have their own management cycle. It
> will be confusing to make Cluster API a part of Trove which has different
> mission - database management and provisioning.
>
> Are you suggesting a 3rd program, cluster as a service? Trove is trying to
> target a generic enough™ API to tackle different technologies with plugins
> or some sort of extensions. This will include a scheduler to determine rack
> awareness. Even if we decide that both Savanna and Trove need their own API
> for building clusters, I still want to understand what makes the Savanna
> API and implementation different, and how Trove can build an API/system
> that can encompass multiple datastore technologies. So regardless of how
> this shakes out, I would urge you to go to the Trove clustering summit
> session [1] so we can share ideas.
>
> Generic enough™ API shouldn't contain a database specific calls like
backups and restore (already in Trove).  Why we need a backup and restore
operations for J2EE or messaging solutions?

> [1] http://summit.openstack.org/cfp/details/54
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] TC Meeting / Savanna Incubation Follow-Up

2013-09-13 Thread Alexander Kuznetsov
On Thu, Sep 12, 2013 at 7:30 PM, Michael Basnight wrote:

> On Sep 12, 2013, at 2:39 AM, Thierry Carrez wrote:
>
> > Sergey Lukjanov wrote:
> >
> >> [...]
> >> As you can see, resources provisioning is just one of the features and
> the implementation details are not critical for overall architecture. It
> performs only the first step of the cluster setup. We’ve been considering
> Heat for a while, but ended up direct API calls in favor of speed and
> simplicity. Going forward Heat integration will be done by implementing
> extension mechanism [3] and [4] as part of Icehouse release.
> >>
> >> The next part, Hadoop cluster configuration, already extensible and we
> have several plugins - Vanilla, Hortonworks Data Platform and Cloudera
> plugin started too. This allow to unify management of different Hadoop
> distributions under single control plane. The plugins are responsible for
> correct Hadoop ecosystem configuration at already provisioned resources and
> use different Hadoop management tools like Ambari to setup and configure
> all cluster  services, so, there are no actual provisioning configs on
> Savanna side in this case. Savanna and its plugins encapsulate the
> knowledge of Hadoop internals and default configuration for Hadoop services.
> >
> > My main gripe with Savanna is that it combines (in its upcoming release)
> > what sounds like to me two very different services: Hadoop cluster
> > provisioning service (like what Trove does for databases) and a
> > MapReduce+ data API service (like what Marconi does for queues).
> >
> > Making it part of the same project (rather than two separate projects,
> > potentially sharing the same program) make discussions about shifting
> > some of its clustering ability to another library/project more complex
> > than they should be (see below).
> >
> > Could you explain the benefit of having them within the same service,
> > rather than two services with one consuming the other ?
>
> And for the record, i dont think that Trove is the perfect fit for it
> today. We are still working on a clustering API. But when we create it, i
> would love the Savanna team's input, so we can try to make a pluggable API
> thats usable for people who want MySQL or Cassandra or even Hadoop. Im less
> a fan of a clustering library, because in the end, we will both have API
> calls like POST /clusters, GET /clusters, and there will be API duplication
> between the projects.
>
> I think that Cluster API (if it would be created) will be helpful not only
for Trove and Savanna.  NoSQL, RDBMS and Hadoop are not unique software
which can be clustered. What about different kind of messaging solutions
like RabbitMQ, ActiveMQ or J2EE containers like JBoss, Weblogic and
WebSphere, which often are installed in clustered mode. Messaging,
databases, J2EE containers and Hadoop have their own management cycle. It
will be confusing to make Cluster API a part of Trove which has different
mission - database management and provisioning.
>
> >
> >> The next topic is “Cluster API”.
> >>
> >> The concern that was raised is how to extract general clustering
> functionality to the common library. Cluster provisioning and management
> topic currently relevant for a number of projects within OpenStack
> ecosystem: Savanna, Trove, TripleO, Heat, Taskflow.
> >>
> >> Still each of the projects has their own understanding of what the
> cluster provisioning is. The idea of extracting common functionality sounds
> reasonable, but details still need to be worked out.
> >>
> >> I’ll try to highlight Savanna team current perspective on this
> question. Notion of “Cluster management” in my perspective has several
> levels:
> >> 1. Resources provisioning and configuration (like instances, networks,
> storages). Heat is the main tool with possibly additional support from
> underlying services. For example, instance grouping API extension [5] in
> Nova would be very useful.
> >> 2. Distributed communication/task execution. There is a project in
> OpenStack ecosystem with the mission to provide a framework for distributed
> task execution - TaskFlow [6]. It’s been started quite recently. In Savanna
> we are really looking forward to use more and more of its functionality in
> I and J cycles as TaskFlow itself getting more mature.
> >> 3. Higher level clustering - management of the actual services working
> on top of the infrastructure. For example, in Savanna configuring HDFS data
> nodes or in Trove setting up MySQL cluster with Percona or Galera. This
> operations are typical very specific for the project domain. As for Savanna
> specifically, we use lots of benefits of Hadoop internals knowledge to
> deploy and configure it properly.
> >>
> >> Overall conclusion it seems to be that it make sense to enhance Heat
> capabilities and invest in Taskflow development, leaving domain-specific
> operations to the individual projects.
> >
> > The thing we'd need to clarify (and the incubation period would be used
> > to achieve

Re: [openstack-dev] [nova] [savanna] Host information for non admin users

2013-09-13 Thread Alexander Kuznetsov
Thanks for your comments let me explain a bit more about Hadoop topology.

In Hadoop 1.2 version,  4 level topologies were introduced: all network,
rack, node group (represent Hadoop nodes on the same compute host in the
simplest case) and node. Usually Hadoop has replication factor 3. In this
case Hadoop placement algorithm is trying to put a HDFS block in the local
node or local node group, second replica should be placed outside the node
group, but on the same rack, and the last replica outside the initial rack.
Topology is defined by the path to vm e.g.

/datacenter1/rack1/host1/vm1
/datacenter1/rack1/host1/vm2
/datacenter1/rack1/host2/vm1
/datacenter1/rack1/host2/vm2
/datacenter1/rack2/host3/vm1
/datacenter1/rack2/host3/vm2


Also, this information will be used for job routing, to place the mapper as
closest as possible to the data.


The main idea to provide this information to Hadoop. Usually it direct
mapping between physical data center structure and Hadoop node placement,
but the case of public center the some abstract names will be fine if this
configuration a reflex a proximity information for Hadoop nodes.


Mike as I understand  holistic scheduler can provide needed information.
Can you give more details about it?


On Fri, Sep 13, 2013 at 11:54 AM, John Garbutt  wrote:

> Exposing the detailed info in private cloud, sure makes sense. For
> public clouds, not so sure. Would be nice to find something that works
> for both.
>
> We let the user express their intent through the instance groups api.
> The scheduler will then do a best effort to meet that criteria, using
> its private information. At a courser grain, we have availability
> zones, that you could use to express "closeness", and probably often
> give you a good measure of closeness anyway.
>
> So a Hadoop user could request a several small groups of VMs defined
> in instance groups to be close, and maybe spread across different
> availability zones.
>
> Would that do the trick? Or does Hadoop/HDFS need a bit more
> granularity than that? Could it look to auto-detect "closeness" in
> some auto-setup phase, given rough user hints?
>
> John
>
> On 13 September 2013 07:40, Alex Glikson  wrote:
> > If I understand correctly, what really matters at least in case of
> Hadoop is
> > network proximity between instances.
> > Hence, maybe Neutron would be a better fit to provide such information.
> In
> > particular, depending on virtual network configuration, having 2
> instances
> > on the same node does not guarantee that the network traffic between them
> > will be routed within the node.
> > Physical layout could be useful for availability-related purposes. But
> even
> > then, it should be abstracted in such a way that it will not reveal
> details
> > that a cloud provider will typically prefer not to expose. Maybe this
> can be
> > done by Ironic -- or a separate/new project (Tuskar sounds related).
> >
> > Regards,
> > Alex
> >
> >
> >
> >
> > From:Mike Spreitzer 
> > To:OpenStack Development Mailing List
> > ,
> > Date:13/09/2013 08:54 AM
> > Subject:Re: [openstack-dev] [nova] [savanna] Host information for
> > nonadminusers
> > 
> >
> >
> >
> >> From: Nirmal Ranganathan 
> >> ...
> >> Well that's left upto the specific block placement policies in hdfs,
> >> all we are providing with the topology information is a hint on
> >> node/rack placement.
> >
> > Oh, you are looking at the placement of HDFS blocks within the fixed
> storage
> > volumes, not choosing where to put the storage volumes.  In that case I
> > understand and agree that simply providing identifiers from the
> > infrastructure to the middleware (HDFS) will suffice.  Coincidentally my
> > group is working on this very example right now in our own environment.
>  We
> > have a holistic scheduler that is given a whole template to place, and it
> > returns placement information.  We imagine, as does Hadoop, a general
> > hierarchy in the physical layout, and the holistic scheduler returns, for
> > each VM, the path from the root to the VM's host.
> >
> > Regards,
> >
> > Mike___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] [savanna] Host information for non admin users

2013-09-12 Thread Alexander Kuznetsov
Hi folks,

Currently Nova doesn’t provide information about the host of virtual
machine for non admin users. Is it possible to change this situation? This
information is needed in Hadoop deployment case. Because now Hadoop aware
about virtual environment and this knowledge help Hadoop to achieve a
better performance and robustness.

Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [savanna] Program name and Mission statement

2013-09-10 Thread Alexander Kuznetsov
My suggestion OpenStack Data Processing.


On Tue, Sep 10, 2013 at 4:15 PM, Sergey Lukjanov wrote:

> Hi folks,
>
> due to the Incubator Application we should prepare Program name and
> Mission statement for Savanna, so, I want to start mailing thread about it.
>
> Please, provide any ideas here.
>
> P.S. List of existing programs: https://wiki.openstack.org/wiki/Programs
> P.P.S. https://wiki.openstack.org/wiki/Governance/NewPrograms
>
> Sincerely yours,
> Sergey Lukjanov
> Savanna Technical Lead
> Mirantis Inc.
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Savanna-all] Savanna EDP sequence diagrams added for discussion...

2013-07-22 Thread Alexander Kuznetsov
I updated the REST API draft -
https://etherpad.openstack.org/savanna_API_draft_EDP_extensions. New
methods related to job source and data discovery components were added;
also the job object was updated.


On Fri, Jul 19, 2013 at 12:26 AM, Trevor McKay  wrote:

> fyi, updates to the diagram based on feedback
>
> On Thu, 2013-07-18 at 13:49 -0400, Trevor McKay wrote:
> > Hi all,
> >
> >   Here is a page to hold sequence diagrams for Savanna EDP,
> > based on current launchpad blueprints.  We thought it might be helpful to
> > create some diagrams for discussion as the component specs are written
> and the
> > API is worked out:
> >
> >   https://wiki.openstack.org/wiki/Savanna/EDP_Sequences
> >
> >   (The main page for EDP is here
> https://wiki.openstack.org/wiki/Savanna/EDP )
> >
> >   There is an initial sequence there, along with a link to the source
> > for generating the PNG with PlantUML.  Feedback would be great, either
> > through IRC, email, comments on the wiki, or by modifying
> > the sequence and/or posting additional sequences.
> >
> >   The sequences can be generated/modified easily with with Plantuml which
> > installs as a single jar file:
> >
> >   http://plantuml.sourceforge.net/download.html
> >
> >   java -jar plantuml.jar
> >
> >   Choose the directory which contains plantuml text files and it will
> > monitor, generate, and update PNGs as you save/modify text files. I
> thought
> > it was broken the first time I ran it because there are no controls :)
> > Very simple.
> >
> > Best,
> >
> > Trevor
> >
> >
>
>
>
> --
> Mailing list: https://launchpad.net/~savanna-all
> Post to : savanna-...@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~savanna-all
> More help   : https://help.launchpad.net/ListHelp
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] savanna version 0.3 - added UI mockups for EDP workflow

2013-07-16 Thread Alexander Kuznetsov
Chad,

I suggest following improvement in Job Parameter Tab. You can find new
variant at the same page under your variant.

Alexander Kuznetsov.


On Tue, Jul 16, 2013 at 1:59 PM, Ruslan Kamaldinov  wrote:

> Chad,
>
> I'd like to see more details about job progress on the "Job list view". It
> should display current progress, logs, errors.
>
> For Hive, Pig and Oozie flows it would be useful to list all the jobs from
> the task. Something similar to https://github.com/twitter/ambrose would
> be great (without fancy graphs).
>
>
> Ruslan
>
>
> On Fri, Jul 12, 2013 at 7:14 PM, Chad Roberts  wrote:
>
>> I have added some initial UI mockups for version 0.3.
>> Any comments are appreciated.
>>
>> https://wiki.openstack.org/wiki/Savanna/UIMockups/JobCreation
>>
>> Thanks,
>> Chad
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] savanna version 0.3 - added UI mockups for EDP workflow

2013-07-12 Thread Alexander Kuznetsov
On the tab with parameters, we see case for Hadoop streaming API. Could you
please add more examples for parameters tab including cases for Hadoop jar,
Pig and Hive scripts?

Thanks,
Alexander Kuznetsov.


On Fri, Jul 12, 2013 at 7:14 PM, Chad Roberts  wrote:

> I have added some initial UI mockups for version 0.3.
> Any comments are appreciated.
>
> https://wiki.openstack.org/wiki/Savanna/UIMockups/JobCreation
>
> Thanks,
> Chad
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Savanna-all] Blueprints for EDP components

2013-07-11 Thread Alexander Kuznetsov
Hi,

Blueprints for EDP components on launchpad are added

https://blueprints.launchpad.net/savanna/+spec/job-manager-components
https://blueprints.launchpad.net/savanna/+spec/data-discovery-component
https://blueprints.launchpad.net/savanna/+spec/job-source-component
https://blueprints.launchpad.net/savanna/+spec/methods-for-plugin-api-to-support-edp

Each blueprint contains short component descriptions, objects model and
methods, which will be implemented in this component.

Your comments and suggestions are welcome.

Thanks,
Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Savanna version 0.3 - on demand Hadoop task execution

2013-07-02 Thread Alexander Kuznetsov
We want to initiate discussion about Elastic Data Processing (EDP) Savanna
component. This functionality is planned to be implemented in the next
development phase starting on July 15. The main questions to address:

   -

   what kind of functionality should be implemented for EDP?
   -

   what are the main components and their responsibilities?
   -

   which existing tools like Hue or Oozie should be used?


To have something to start, we have prepared an overview of our thoughts in
the following document https://wiki.openstack.org/wiki/Savanna/EDP. For you
convenience, you can find the text below. Your comments and suggestions are
welcome.

Key Features

Starting the job:

   -

   Simple REST API and UI
   -

   TODO: mockups
   -

   Job can be entered through UI/API or pulled through VCS
   -

   Configurable data source



Job execution modes:

   -

   Run job on one of the existing cluster
   -

  Expose information on cluster load
  -

  Provide hints for optimizing data locality TODO: more details
  -

   Create new transient cluster for the job


Job structure:

   -

   Individual job via jar file, Pig or Hive script
   -

   Oozie workflow
   -

  In future to support EMR job flows import


Job execution tracking and monitoring

   -

   Any existent components that can help to visualize? (Twitter
Ambrose<https://github.com/twitter/ambrose>
   )
   -

   Terminate job
   -

   Auto-scaling functionality


Main EDP Components Data discovery component

EDP can have several sources of data for processing. Data can be pulled
from Swift, GlusterFS or NoSQL database like Cassandra or HBase. To provide
an unified access to this data we’ll introduce a component responsible for
discovering data location and providing right configuration for Hadoop
cluster. It should have a pluggable system.
Job Source

Users would like to execute different types of jobs: jar file, Pig and Hive
scripts, Oozie job flows, etc.  Job description and source code can be
supplied in a different way. Some users just want to insert hive script and
run it. Other users want to save this script in Savanna internal database
for later use. We also need to provide an ability to run a job from source
code stored in vcs.
Savanna Dispatcher Component

This component is responsible for provisioning a new cluster, scheduling
job on new or existing cluster, resizing cluster and gathering information
from clusters about current jobs and utilization. Also, it should provide
information to help to make a right decision where to schedule job, create
a new cluster or use existing one. For example, current loads on clusters,
their proximity to the data location etc.
UI Component

Integration into OpenStack Dashboard - Horizon. It should provide
instruments for job creation, monitoring etc.

Cloudera Hue already provides part of this functionality: submit jobs (jar
file, Hive, Pig, Impala), view job status and output.
Cluster Level Coordination Component

Expose information about jobs on a specific cluster. Possible this
component should be represent by existing Hadoop projects Hue and Oozie.
User Workflow

- User selects or creates a job to run

- User chooses data source for appropriate type for this job

- Dispatcher provides hints to user about a better way to scheduler this
job (on existing clusters or create a new one)

- User makes a decision based on the hint from dispatcher

- Dispatcher (if needed) creates or resizes existing cluster and schedules
job to it
- Dispatcher periodically pull status of job and shows it on UI

Thanks,

Alexander Kuznetsov
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev