Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
> FWIW, As a consumer I'm not excited about seeing ES make its way back into > the katello ecosystem. All of my opinion is based on the fact that I use > pulp inside Katello. Me neither. > PostgreSQL's more recent versions extended to NoSQL feature sets that can > be very performant. Simply googling PostgreSQL NoSQL points to lots of > articles on it. Actually full text search is a feature that is in PostgreSQL for years as a plugin and it was included in core I think somewhere in 8.x series. I am not sure if this fulfills the NoSQL buzzword, but it's something that works just fine with gigabytes of data (which I tested myself). It integrates with ispell for stemming (which is really great feature that Lucene didn't have on par for years) and configuration is trivial. Having the search integrated in one database is huge benefit. Separate indexing components tend to be slow on updates with possibility to become out of sync. Data can be reindexed, but that does not solve the root cause of a problem. I expect Pulp will be indexing only some parts of data - I can imagine package names do not need to be indexed at all since they have their own index already and with PostgreSQL integrated solution you can use them both (package name index plus full text for let's say errata texts if I understand your motivation correctly. Also, having all the data under one roof (and one transaction) can be really big deal for data integrity, backup and security. As a (small but) Lucene contributor and with experiences with Lucene, ES and PostgreSQL full text search capabilities, I'd try to evaluate the PostgreSQL option for real. Searching API is usually quite easy, the most difficult part is preparing the data. And you will be doing that regardless of the chosen technology stack. Therefore I think the missing django plugin for PostgreSQL full text search might not be the biggest issue at all. Google found some links if you want to see some comparison: http://es.slideshare.net/billkarwin/full-text-search-in-postgresql PostgreSQL full text outperforms Lucene 4 times in this one and takes less index data on disk. This was just a quick search, but I want to show that Lucene/ES won't be faster than PostgreSQL by order of magnitude. And RDBMS scaling is not a *real* issue for decades. I am really happy you are back in RDBMS business folks :-) -- Later, Lukas #lzap Zapletal ___ Pulp-list mailing list Pulp-list@redhat.com https://www.redhat.com/mailman/listinfo/pulp-list
Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
My personal experience with django as an ORM is less than stellar, but that was based on an older version of django that I had to retrofit on existing schema. For instance, the ORM insists that every table has to have an 'id' primary key, even many-to-many tables. While not incorrect, it's annoying and unnecessary in my opinion - the primary key would be a composite of the two foreign keys. Maybe it's easier if you start with a clean schema. I like sqlalchemy much better, personally. Maybe the orm has changed recently. Mihai's $0.02 On Tue, May 17, 2016 at 1:55 PM, Sean Myers wrote: > Based on the feedback so far, I haven't seen any issues with what > I've proposed here other than elasticsearch. I'll be digging into > that piece of the stack and revaluating the options out there, > taking the feedback from this thread into account. > > Thanks! > > > ___ > Pulp-list mailing list > Pulp-list@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-list > ___ Pulp-list mailing list Pulp-list@redhat.com https://www.redhat.com/mailman/listinfo/pulp-list
Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
Based on the feedback so far, I haven't seen any issues with what I've proposed here other than elasticsearch. I'll be digging into that piece of the stack and revaluating the options out there, taking the feedback from this thread into account. Thanks! signature.asc Description: OpenPGP digital signature ___ Pulp-list mailing list Pulp-list@redhat.com https://www.redhat.com/mailman/listinfo/pulp-list
Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
FWIW, As a consumer I'm not excited about seeing ES make its way back into the katello ecosystem. All of my opinion is based on the fact that I use pulp inside Katello. PostgreSQL's more recent versions extended to NoSQL feature sets that can be very performant. Simply googling PostgreSQL NoSQL points to lots of articles on it. Having supported on several large applications that didn't scale well due to ORMs I'm not a huge fan of them myself, but I'd rather have an ORM than have ES on my systems. Since i keep saying i dont want ES but dont have any reasons here are my top 3: 1: Bundled everything - we are big on packages (hey pulp!) and having a big bundled package from an open source project just rubs me wrong and has other fun issues that yall will be stuck deal with. 2: Scaling - Katello's install isn't really designed to easily be built across multiple systems, even if ES is. Not that you can't do it, but breaking things out can be...interesting. Then, ES requires you to think about scale from the get go. If pulp or katello initialize a default, there is a strong requirement to oversize everything upfront, but even that is dangerous. https://www.elastic.co/guide/en/elasticsearch/guide/master/scale.html 3: Data integrity (i've had supposedly recoverable shards that i had to loose completely cause they would not come back online no matter what documentation tells you) -greg On Thu, May 12, 2016 at 1:09 PM Eric Helms wrote: > On Thu, May 12, 2016 at 10:51 AM, Sean Myers > wrote: > >> Early planning for Pulp 3.0 is building up some steam, and it's >> a good time to go over the proposed technology stack that we're >> looking at right now that we're looking at to build on. For all >> of these choices, once Pulp's basic needs are met, the major >> deciding factor for what library to use is decided by "meta" >> factors, like community support, release processes, etc. Special >> thanks to Jeff Ortel for making sure my assumptions about these >> tools got challenged so the right choices get made. >> >> We're using postgres as the DB for 3.0. Since we're going >> relational, the next thing we'd want is a good ORM. Several team >> members have experience with the Django ORM, and Pulp is actually >> already using it in its views. It has a fantastic community, is >> well documented, and comes with a vast multitude of third-party >> plugins to help us fill in any gaps in functionality that may be >> found. Our current tasking system is build on Celery[0], which is >> among those third-party plugins with excellent Django support, >> which potentially means that using Django with a relational DB >> can help us get rid of code where we overlap functionality that >> may be provided by django-celery. >> >> Other ORM options were considered, but only SQLAlchemy (another >> very good ORM) stood out as something we could use if there was >> a compelling reason to switch from Django, but at this time there >> is no such reason. Django does the job well. Most other ORMs are >> either not robust enough in their feature-set or apparently not >> being actively maintained, and were rejected as alternatives. >> Also rejected outright was not using an ORM (or other form of >> data mapper) at all, since my sense is that we all agree that >> we don't want to manually be writing SQL. :) >> >> This leads to the next big building block, which is the tool we >> should use to build our REST APIs. I've used django-tastypie in >> the past, as have a few other team members, and it was my front- >> runner for this job. After looking around though, it looks like >> django-rest-framework (DRF) is currently dominating this space >> in the Django community[0]. Going through some of their tutorials >> and examples, it's looking like tastypie is out of the running, >> and DRF is the winner. Both would be adequate for Pulp's needs >> when it comes to putting a REST API on top of our data model, so >> it makes sense to go with the more "popular" option. In addition, >> I think its documentation and API are easier to work with than >> tastypie's, so it's simultaneously easier to use and easier to >> *learn how* to use. >> >> Finally, we're looking at bringing in a search engine for the >> search views in the API. We're currently doing search using >> mongodb, using mongo-specific search criteria, but will be >> decoupling the search API from the search engine. As with Django, >> a few team members have experience using elasticsearch (myself >> included). Elasticsearch is java-based, running on top of the >> Lucene indexer, with a simple REST API on top of it, and so at >> the moment it's my preferred search engine. >> >> I looked at a few other search engines in recent testing, including >> the pure-python engine "Whoosh", Solr (also uses lucene), Xapian, >> and Sphinx (the search engine, not the document builder). Of these, >> only Whoosh and Elasticsearch have first-party support by the >> django-haystack project[2], which is both my
Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
Check out the Django Elasticsearch py package called Django Seeker. It supports elasticsearch 1.x and 2.x and the querying/faceting you were looking for. https://github.com/imsweb/django-seeker http://django-seeker.readthedocs.io/en/latest/ My company maintains it FYI. Per our Django developers, haystack ES seemed under-maintained for our needs. -Original Message- From: pulp-list-boun...@redhat.com [mailto:pulp-list-boun...@redhat.com] On Behalf Of Sean Myers Sent: Monday, May 16, 2016 9:18 AM To: Eric Helms Cc: pulp-list Subject: Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications On 05/12/2016 02:09 PM, Eric Helms wrote: > Can you expand on why a separate search service is needed and how > Postgres won't fill your needs? Unfortunately, Django itself doesn't meet our needs for search when using postgres as the DB. To get it, we've got to go out to a plugin. The plugins available for this are pretty slim pickings, and haystack stands out as the best option among them. Since Django's our interface to the DB, we'd have to bolt on our own search functionality to make the DB do this work. Information in this e-mail may be confidential. It is intended only for the addressee(s) identified above. If you are not the addressee(s), or an employee or agent of the addressee(s), please note that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender of the error. ___ Pulp-list mailing list Pulp-list@redhat.com https://www.redhat.com/mailman/listinfo/pulp-list
Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
On 05/12/2016 02:09 PM, Eric Helms wrote: > Can you expand on why a separate search service is needed and how Postgres > won't fill your needs? Unfortunately, Django itself doesn't meet our needs for search when using postgres as the DB. To get it, we've got to go out to a plugin. The plugins available for this are pretty slim pickings, and haystack stands out as the best option among them. Since Django's our interface to the DB, we'd have to bolt on our own search functionality to make the DB do this work. signature.asc Description: OpenPGP digital signature ___ Pulp-list mailing list Pulp-list@redhat.com https://www.redhat.com/mailman/listinfo/pulp-list
Re: [Pulp-list] Pulp 3.0 Technology Stack Justifications
On Thu, May 12, 2016 at 10:51 AM, Sean Myers wrote: > Early planning for Pulp 3.0 is building up some steam, and it's > a good time to go over the proposed technology stack that we're > looking at right now that we're looking at to build on. For all > of these choices, once Pulp's basic needs are met, the major > deciding factor for what library to use is decided by "meta" > factors, like community support, release processes, etc. Special > thanks to Jeff Ortel for making sure my assumptions about these > tools got challenged so the right choices get made. > > We're using postgres as the DB for 3.0. Since we're going > relational, the next thing we'd want is a good ORM. Several team > members have experience with the Django ORM, and Pulp is actually > already using it in its views. It has a fantastic community, is > well documented, and comes with a vast multitude of third-party > plugins to help us fill in any gaps in functionality that may be > found. Our current tasking system is build on Celery[0], which is > among those third-party plugins with excellent Django support, > which potentially means that using Django with a relational DB > can help us get rid of code where we overlap functionality that > may be provided by django-celery. > > Other ORM options were considered, but only SQLAlchemy (another > very good ORM) stood out as something we could use if there was > a compelling reason to switch from Django, but at this time there > is no such reason. Django does the job well. Most other ORMs are > either not robust enough in their feature-set or apparently not > being actively maintained, and were rejected as alternatives. > Also rejected outright was not using an ORM (or other form of > data mapper) at all, since my sense is that we all agree that > we don't want to manually be writing SQL. :) > > This leads to the next big building block, which is the tool we > should use to build our REST APIs. I've used django-tastypie in > the past, as have a few other team members, and it was my front- > runner for this job. After looking around though, it looks like > django-rest-framework (DRF) is currently dominating this space > in the Django community[0]. Going through some of their tutorials > and examples, it's looking like tastypie is out of the running, > and DRF is the winner. Both would be adequate for Pulp's needs > when it comes to putting a REST API on top of our data model, so > it makes sense to go with the more "popular" option. In addition, > I think its documentation and API are easier to work with than > tastypie's, so it's simultaneously easier to use and easier to > *learn how* to use. > > Finally, we're looking at bringing in a search engine for the > search views in the API. We're currently doing search using > mongodb, using mongo-specific search criteria, but will be > decoupling the search API from the search engine. As with Django, > a few team members have experience using elasticsearch (myself > included). Elasticsearch is java-based, running on top of the > Lucene indexer, with a simple REST API on top of it, and so at > the moment it's my preferred search engine. > > I looked at a few other search engines in recent testing, including > the pure-python engine "Whoosh", Solr (also uses lucene), Xapian, > and Sphinx (the search engine, not the document builder). Of these, > only Whoosh and Elasticsearch have first-party support by the > django-haystack project[2], which is both my preferred and the most > commonly used django search plugin[3]. Given my previous positive > experience with Elasticsearch, I think it's probably the best choice > for a search indexer at this time. > Can you expand on why a separate search service is needed and how Postgres won't fill your needs? Thanks, Eric > The Whoosh plugin for Haystack currently doesn't support a very > useful feature that Whoosh itself does support, which is faceting. > This feature gap is something that would need to be closed (likely > by us) to get feature parity between the elasticsearch and whoosh > backends. > > While there are other libraries that appear to live in the same space > as haystack (integrate a search indexer with Django models, providing > Django QuerySet/Model results), none of them have the robust features > and community support seen in haystack. Again, though, decoupling the > search interface from the search implementation means that this piece > is likely to be easy to change out if we find better options in the > future (especially if we write it with this in mind). > > Summary: > - Django ORM on postgres > - django-rest-Framework to build API views > - django-haystack to provide search capabilities, using Elasticsearch > to start, possible switching to Whoosh after some development -- this > switch should occur before any release of 3.0 > > [0]: http://docs.celeryproject.org/en/latest/django/ > [1]: https://www.djangopackages.com/grids/g/rest/ > [2]: http://django-haystack.readthedocs.io/en/st
[Pulp-list] Pulp 3.0 Technology Stack Justifications
Early planning for Pulp 3.0 is building up some steam, and it's a good time to go over the proposed technology stack that we're looking at right now that we're looking at to build on. For all of these choices, once Pulp's basic needs are met, the major deciding factor for what library to use is decided by "meta" factors, like community support, release processes, etc. Special thanks to Jeff Ortel for making sure my assumptions about these tools got challenged so the right choices get made. We're using postgres as the DB for 3.0. Since we're going relational, the next thing we'd want is a good ORM. Several team members have experience with the Django ORM, and Pulp is actually already using it in its views. It has a fantastic community, is well documented, and comes with a vast multitude of third-party plugins to help us fill in any gaps in functionality that may be found. Our current tasking system is build on Celery[0], which is among those third-party plugins with excellent Django support, which potentially means that using Django with a relational DB can help us get rid of code where we overlap functionality that may be provided by django-celery. Other ORM options were considered, but only SQLAlchemy (another very good ORM) stood out as something we could use if there was a compelling reason to switch from Django, but at this time there is no such reason. Django does the job well. Most other ORMs are either not robust enough in their feature-set or apparently not being actively maintained, and were rejected as alternatives. Also rejected outright was not using an ORM (or other form of data mapper) at all, since my sense is that we all agree that we don't want to manually be writing SQL. :) This leads to the next big building block, which is the tool we should use to build our REST APIs. I've used django-tastypie in the past, as have a few other team members, and it was my front- runner for this job. After looking around though, it looks like django-rest-framework (DRF) is currently dominating this space in the Django community[0]. Going through some of their tutorials and examples, it's looking like tastypie is out of the running, and DRF is the winner. Both would be adequate for Pulp's needs when it comes to putting a REST API on top of our data model, so it makes sense to go with the more "popular" option. In addition, I think its documentation and API are easier to work with than tastypie's, so it's simultaneously easier to use and easier to *learn how* to use. Finally, we're looking at bringing in a search engine for the search views in the API. We're currently doing search using mongodb, using mongo-specific search criteria, but will be decoupling the search API from the search engine. As with Django, a few team members have experience using elasticsearch (myself included). Elasticsearch is java-based, running on top of the Lucene indexer, with a simple REST API on top of it, and so at the moment it's my preferred search engine. I looked at a few other search engines in recent testing, including the pure-python engine "Whoosh", Solr (also uses lucene), Xapian, and Sphinx (the search engine, not the document builder). Of these, only Whoosh and Elasticsearch have first-party support by the django-haystack project[2], which is both my preferred and the most commonly used django search plugin[3]. Given my previous positive experience with Elasticsearch, I think it's probably the best choice for a search indexer at this time. The Whoosh plugin for Haystack currently doesn't support a very useful feature that Whoosh itself does support, which is faceting. This feature gap is something that would need to be closed (likely by us) to get feature parity between the elasticsearch and whoosh backends. While there are other libraries that appear to live in the same space as haystack (integrate a search indexer with Django models, providing Django QuerySet/Model results), none of them have the robust features and community support seen in haystack. Again, though, decoupling the search interface from the search implementation means that this piece is likely to be easy to change out if we find better options in the future (especially if we write it with this in mind). Summary: - Django ORM on postgres - django-rest-Framework to build API views - django-haystack to provide search capabilities, using Elasticsearch to start, possible switching to Whoosh after some development -- this switch should occur before any release of 3.0 [0]: http://docs.celeryproject.org/en/latest/django/ [1]: https://www.djangopackages.com/grids/g/rest/ [2]: http://django-haystack.readthedocs.io/en/stable/backend_support.html [3]: https://www.djangopackages.com/grids/g/search/ signature.asc Description: OpenPGP digital signature ___ Pulp-list mailing list Pulp-list@redhat.com https://www.redhat.com/mailman/listinfo/pulp-list