Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-18 Thread Lukas Zapletal
> FWIW, As a consumer I'm not excited about seeing ES make its way back into
> the katello ecosystem.  All of my opinion is based on the fact that I use
> pulp inside Katello.

Me neither.

> PostgreSQL's more recent versions extended to NoSQL feature sets that can
> be very performant.  Simply googling PostgreSQL NoSQL points to lots of
> articles on it.

Actually full text search is a feature that is in PostgreSQL for years
as a plugin and it was included in core I think somewhere in 8.x series.
I am not sure if this fulfills the NoSQL buzzword, but it's something
that works just fine with gigabytes of data (which I tested myself).
It integrates with ispell for stemming (which is really great feature
that Lucene didn't have on par for years) and configuration is trivial.

Having the search integrated in one database is huge benefit. Separate
indexing components tend to be slow on updates with possibility to
become out of sync. Data can be reindexed, but that does not solve the
root cause of a problem. I expect Pulp will be indexing only some parts
of data - I can imagine package names do not need to be indexed at all
since they have their own index already and with PostgreSQL integrated
solution you can use them both (package name index plus full text for
let's say errata texts if I understand your motivation correctly. Also,
having all the data under one roof (and one transaction) can be really
big deal for data integrity, backup and security.

As a (small but) Lucene contributor and with experiences with Lucene, ES
and PostgreSQL full text search capabilities, I'd try to evaluate the
PostgreSQL option for real. Searching API is usually quite easy, the
most difficult part is preparing the data. And you will be doing that
regardless of the chosen technology stack. Therefore I think the missing
django plugin for PostgreSQL full text search might not be the biggest
issue at all.

Google found some links if you want to see some comparison:

http://es.slideshare.net/billkarwin/full-text-search-in-postgresql

PostgreSQL full text outperforms Lucene 4 times in this one and takes
less index data on disk. This was just a quick search, but I want to
show that Lucene/ES won't be faster than PostgreSQL by order of
magnitude. And RDBMS scaling is not a *real* issue for decades.

I am really happy you are back in RDBMS business folks :-)

-- 
Later,
 Lukas #lzap Zapletal

___
Pulp-list mailing list
Pulp-list@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list


Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-17 Thread Mihai Ibanescu
My personal experience with django as an ORM is less than stellar, but that
was based on an older version of django that I had to retrofit on existing
schema. For instance, the ORM insists that every table has to have an 'id'
primary key, even many-to-many tables. While not incorrect, it's annoying
and unnecessary in my opinion - the primary key would be a composite of the
two foreign keys. Maybe it's easier if you start with a clean schema. I
like sqlalchemy much better, personally.

Maybe the orm has changed recently.

Mihai's $0.02

On Tue, May 17, 2016 at 1:55 PM, Sean Myers  wrote:

> Based on the feedback so far, I haven't seen any issues with what
> I've proposed here other than elasticsearch. I'll be digging into
> that piece of the stack and revaluating the options out there,
> taking the feedback from this thread into account.
>
> Thanks!
>
>
> ___
> Pulp-list mailing list
> Pulp-list@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list
>
___
Pulp-list mailing list
Pulp-list@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list

Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-17 Thread Sean Myers
Based on the feedback so far, I haven't seen any issues with what
I've proposed here other than elasticsearch. I'll be digging into
that piece of the stack and revaluating the options out there,
taking the feedback from this thread into account.

Thanks!



signature.asc
Description: OpenPGP digital signature
___
Pulp-list mailing list
Pulp-list@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list

Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-17 Thread Greg Swift
FWIW, As a consumer I'm not excited about seeing ES make its way back into
the katello ecosystem.  All of my opinion is based on the fact that I use
pulp inside Katello.

PostgreSQL's more recent versions extended to NoSQL feature sets that can
be very performant.  Simply googling PostgreSQL NoSQL points to lots of
articles on it.

Having supported on several large applications that didn't scale well due
to ORMs I'm not a huge fan of them myself, but I'd rather have an ORM than
have ES on my systems.

Since i keep saying i dont want ES but dont have any reasons here are my
top 3:

1: Bundled everything - we are big on packages (hey pulp!) and having a big
bundled package from an open source project just rubs me wrong and has
other fun issues that yall will be stuck deal with.

2: Scaling - Katello's install isn't really designed to easily be built
across multiple systems, even if ES is.  Not that you can't do it, but
breaking things out can be...interesting.  Then, ES requires you to think
about scale from the get go. If pulp or katello initialize a default, there
is a strong requirement to oversize everything upfront, but even that is
dangerous.
https://www.elastic.co/guide/en/elasticsearch/guide/master/scale.html

3: Data integrity (i've had supposedly recoverable shards that i had to
loose completely cause they would not come back online no matter what
documentation tells you)

-greg


On Thu, May 12, 2016 at 1:09 PM Eric Helms  wrote:

> On Thu, May 12, 2016 at 10:51 AM, Sean Myers 
> wrote:
>
>> Early planning for Pulp 3.0 is building up some steam, and it's
>> a good time to go over the proposed technology stack that we're
>> looking at right now that we're looking at to build on. For all
>> of these choices, once Pulp's basic needs are met, the major
>> deciding factor for what library to use is decided by "meta"
>> factors, like community support, release processes, etc. Special
>> thanks to Jeff Ortel for making sure my assumptions about these
>> tools got challenged so the right choices get made.
>>
>> We're using postgres as the DB for 3.0. Since we're going
>> relational, the next thing we'd want is a good ORM. Several team
>> members have experience with the Django ORM, and Pulp is actually
>> already using it in its views. It has a fantastic community, is
>> well documented, and comes with a vast multitude of third-party
>> plugins to help us fill in any gaps in functionality that may be
>> found. Our current tasking system is build on Celery[0], which is
>> among those third-party plugins with excellent Django support,
>> which potentially means that using Django with a relational DB
>> can help us get rid of code where we overlap functionality that
>> may be provided by django-celery.
>>
>> Other ORM options were considered, but only SQLAlchemy (another
>> very good ORM) stood out as something we could use if there was
>> a compelling reason to switch from Django, but at this time there
>> is no such reason. Django does the job well. Most other ORMs are
>> either not robust enough in their feature-set or apparently not
>> being actively maintained, and were rejected as alternatives.
>> Also rejected outright was not using an ORM (or other form of
>> data mapper) at all, since my sense is that we all agree that
>> we don't want to manually be writing SQL. :)
>>
>> This leads to the next big building block, which is the tool we
>> should use to build our REST APIs. I've used django-tastypie in
>> the past, as have a few other team members, and it was my front-
>> runner for this job. After looking around though, it looks like
>> django-rest-framework (DRF) is currently dominating this space
>> in the Django community[0]. Going through some of their tutorials
>> and examples, it's looking like tastypie is out of the running,
>> and DRF is the winner. Both would be adequate for Pulp's needs
>> when it comes to putting a REST API on top of our data model, so
>> it makes sense to go with the more "popular" option. In addition,
>> I think its documentation and API are easier to work with than
>> tastypie's, so it's simultaneously easier to use and easier to
>> *learn how* to use.
>>
>> Finally, we're looking at bringing in a search engine for the
>> search views in the API. We're currently doing search using
>> mongodb, using mongo-specific search criteria, but will be
>> decoupling the search API from the search engine. As with Django,
>> a few team members have experience using elasticsearch (myself
>> included). Elasticsearch is java-based, running on top of the
>> Lucene indexer, with a simple REST API on top of it, and so at
>> the moment it's my preferred search engine.
>>
>> I looked at a few other search engines in recent testing, including
>> the pure-python engine "Whoosh", Solr (also uses lucene), Xapian,
>> and Sphinx (the search engine, not the document builder). Of these,
>> only Whoosh and Elasticsearch have first-party support by the
>> django-haystack project[2], which is both my 

Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-16 Thread Ashby, Jason (IMS)
Check out the Django Elasticsearch py package called Django Seeker. It supports 
elasticsearch 1.x and 2.x and the querying/faceting you were looking for.

https://github.com/imsweb/django-seeker
http://django-seeker.readthedocs.io/en/latest/

My company maintains it FYI. Per our Django developers, haystack ES seemed 
under-maintained for our needs.

-Original Message-
From: pulp-list-boun...@redhat.com [mailto:pulp-list-boun...@redhat.com] On 
Behalf Of Sean Myers
Sent: Monday, May 16, 2016 9:18 AM
To: Eric Helms 
Cc: pulp-list 
Subject: Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

On 05/12/2016 02:09 PM, Eric Helms wrote:
> Can you expand on why a separate search service is needed and how
> Postgres won't fill your needs?

Unfortunately, Django itself doesn't meet our needs for search when using 
postgres as the DB.

To get it, we've got to go out to a plugin. The plugins available for this are 
pretty slim pickings, and haystack stands out as the best option among them. 
Since Django's our interface to the DB, we'd have to bolt on our own search 
functionality to make the DB do this work.




Information in this e-mail may be confidential. It is intended only for the 
addressee(s) identified above. If you are not the addressee(s), or an employee 
or agent of the addressee(s), please note that any dissemination, distribution, 
or copying of this communication is strictly prohibited. If you have received 
this e-mail in error, please notify the sender of the error.

___
Pulp-list mailing list
Pulp-list@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list


Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-16 Thread Sean Myers
On 05/12/2016 02:09 PM, Eric Helms wrote:
> Can you expand on why a separate search service is needed and how Postgres
> won't fill your needs?

Unfortunately, Django itself doesn't meet our needs for search when using
postgres as the DB.

To get it, we've got to go out to a plugin. The plugins available for this 
are pretty slim pickings, and haystack stands out as the best option among 
them. Since Django's our interface to the DB, we'd have to bolt on our own 
search functionality to make the DB do this work.



signature.asc
Description: OpenPGP digital signature
___
Pulp-list mailing list
Pulp-list@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list

Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-12 Thread Eric Helms
On Thu, May 12, 2016 at 10:51 AM, Sean Myers  wrote:

> Early planning for Pulp 3.0 is building up some steam, and it's
> a good time to go over the proposed technology stack that we're
> looking at right now that we're looking at to build on. For all
> of these choices, once Pulp's basic needs are met, the major
> deciding factor for what library to use is decided by "meta"
> factors, like community support, release processes, etc. Special
> thanks to Jeff Ortel for making sure my assumptions about these
> tools got challenged so the right choices get made.
>
> We're using postgres as the DB for 3.0. Since we're going
> relational, the next thing we'd want is a good ORM. Several team
> members have experience with the Django ORM, and Pulp is actually
> already using it in its views. It has a fantastic community, is
> well documented, and comes with a vast multitude of third-party
> plugins to help us fill in any gaps in functionality that may be
> found. Our current tasking system is build on Celery[0], which is
> among those third-party plugins with excellent Django support,
> which potentially means that using Django with a relational DB
> can help us get rid of code where we overlap functionality that
> may be provided by django-celery.
>
> Other ORM options were considered, but only SQLAlchemy (another
> very good ORM) stood out as something we could use if there was
> a compelling reason to switch from Django, but at this time there
> is no such reason. Django does the job well. Most other ORMs are
> either not robust enough in their feature-set or apparently not
> being actively maintained, and were rejected as alternatives.
> Also rejected outright was not using an ORM (or other form of
> data mapper) at all, since my sense is that we all agree that
> we don't want to manually be writing SQL. :)
>
> This leads to the next big building block, which is the tool we
> should use to build our REST APIs. I've used django-tastypie in
> the past, as have a few other team members, and it was my front-
> runner for this job. After looking around though, it looks like
> django-rest-framework (DRF) is currently dominating this space
> in the Django community[0]. Going through some of their tutorials
> and examples, it's looking like tastypie is out of the running,
> and DRF is the winner. Both would be adequate for Pulp's needs
> when it comes to putting a REST API on top of our data model, so
> it makes sense to go with the more "popular" option. In addition,
> I think its documentation and API are easier to work with than
> tastypie's, so it's simultaneously easier to use and easier to
> *learn how* to use.
>
> Finally, we're looking at bringing in a search engine for the
> search views in the API. We're currently doing search using
> mongodb, using mongo-specific search criteria, but will be
> decoupling the search API from the search engine. As with Django,
> a few team members have experience using elasticsearch (myself
> included). Elasticsearch is java-based, running on top of the
> Lucene indexer, with a simple REST API on top of it, and so at
> the moment it's my preferred search engine.
>
> I looked at a few other search engines in recent testing, including
> the pure-python engine "Whoosh", Solr (also uses lucene), Xapian,
> and Sphinx (the search engine, not the document builder). Of these,
> only Whoosh and Elasticsearch have first-party support by the
> django-haystack project[2], which is both my preferred and the most
> commonly used django search plugin[3]. Given my previous positive
> experience with Elasticsearch, I think it's probably the best choice
> for a search indexer at this time.
>

Can you expand on why a separate search service is needed and how Postgres
won't fill your needs?

Thanks,
Eric


> The Whoosh plugin for Haystack currently doesn't support a very
> useful feature that Whoosh itself does support, which is faceting.
> This feature gap is something that would need to be closed (likely
> by us) to get feature parity between the elasticsearch and whoosh
> backends.
>
> While there are other libraries that appear to live in the same space
> as haystack (integrate a search indexer with Django models, providing
> Django QuerySet/Model results), none of them have the robust features
> and community support seen in haystack. Again, though, decoupling the
> search interface from the search implementation means that this piece
> is likely to be easy to change out if we find better options in the
> future (especially if we write it with this in mind).
>
> Summary:
> - Django ORM on postgres
> - django-rest-Framework to build API views
> - django-haystack to provide search capabilities, using Elasticsearch
>   to start, possible switching to Whoosh after some development -- this
>   switch should occur before any release of 3.0
>
> [0]: http://docs.celeryproject.org/en/latest/django/
> [1]: https://www.djangopackages.com/grids/g/rest/
> [2]: http://django-haystack.readthedocs.io/en/st

[Pulp-list] Pulp 3.0 Technology Stack Justifications

2016-05-12 Thread Sean Myers
Early planning for Pulp 3.0 is building up some steam, and it's
a good time to go over the proposed technology stack that we're
looking at right now that we're looking at to build on. For all
of these choices, once Pulp's basic needs are met, the major
deciding factor for what library to use is decided by "meta"
factors, like community support, release processes, etc. Special
thanks to Jeff Ortel for making sure my assumptions about these
tools got challenged so the right choices get made.

We're using postgres as the DB for 3.0. Since we're going
relational, the next thing we'd want is a good ORM. Several team
members have experience with the Django ORM, and Pulp is actually
already using it in its views. It has a fantastic community, is
well documented, and comes with a vast multitude of third-party
plugins to help us fill in any gaps in functionality that may be
found. Our current tasking system is build on Celery[0], which is
among those third-party plugins with excellent Django support,
which potentially means that using Django with a relational DB
can help us get rid of code where we overlap functionality that
may be provided by django-celery.

Other ORM options were considered, but only SQLAlchemy (another
very good ORM) stood out as something we could use if there was
a compelling reason to switch from Django, but at this time there
is no such reason. Django does the job well. Most other ORMs are
either not robust enough in their feature-set or apparently not
being actively maintained, and were rejected as alternatives.
Also rejected outright was not using an ORM (or other form of
data mapper) at all, since my sense is that we all agree that
we don't want to manually be writing SQL. :)

This leads to the next big building block, which is the tool we
should use to build our REST APIs. I've used django-tastypie in
the past, as have a few other team members, and it was my front-
runner for this job. After looking around though, it looks like
django-rest-framework (DRF) is currently dominating this space
in the Django community[0]. Going through some of their tutorials
and examples, it's looking like tastypie is out of the running,
and DRF is the winner. Both would be adequate for Pulp's needs
when it comes to putting a REST API on top of our data model, so
it makes sense to go with the more "popular" option. In addition,
I think its documentation and API are easier to work with than
tastypie's, so it's simultaneously easier to use and easier to
*learn how* to use.

Finally, we're looking at bringing in a search engine for the 
search views in the API. We're currently doing search using
mongodb, using mongo-specific search criteria, but will be
decoupling the search API from the search engine. As with Django,
a few team members have experience using elasticsearch (myself
included). Elasticsearch is java-based, running on top of the
Lucene indexer, with a simple REST API on top of it, and so at
the moment it's my preferred search engine.

I looked at a few other search engines in recent testing, including
the pure-python engine "Whoosh", Solr (also uses lucene), Xapian,
and Sphinx (the search engine, not the document builder). Of these,
only Whoosh and Elasticsearch have first-party support by the
django-haystack project[2], which is both my preferred and the most
commonly used django search plugin[3]. Given my previous positive
experience with Elasticsearch, I think it's probably the best choice
for a search indexer at this time.

The Whoosh plugin for Haystack currently doesn't support a very
useful feature that Whoosh itself does support, which is faceting.
This feature gap is something that would need to be closed (likely
by us) to get feature parity between the elasticsearch and whoosh
backends.

While there are other libraries that appear to live in the same space
as haystack (integrate a search indexer with Django models, providing
Django QuerySet/Model results), none of them have the robust features
and community support seen in haystack. Again, though, decoupling the
search interface from the search implementation means that this piece
is likely to be easy to change out if we find better options in the
future (especially if we write it with this in mind).

Summary:
- Django ORM on postgres
- django-rest-Framework to build API views
- django-haystack to provide search capabilities, using Elasticsearch
  to start, possible switching to Whoosh after some development -- this
  switch should occur before any release of 3.0

[0]: http://docs.celeryproject.org/en/latest/django/
[1]: https://www.djangopackages.com/grids/g/rest/
[2]: http://django-haystack.readthedocs.io/en/stable/backend_support.html
[3]: https://www.djangopackages.com/grids/g/search/



signature.asc
Description: OpenPGP digital signature
___
Pulp-list mailing list
Pulp-list@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list