[ https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886701#action_12886701 ]
Karl Wright commented on CONNECTORS-55: --------------------------------------- >>>>>> Can you help me out and give me more ideas on what particular performance problems you are concerned about (e.g. query types or whatever) ? <<<<<< Hi Robert, There are two major determinants of performance for LCF, under Postgresql at any rate. The first is the performance of the queue stuffer query, and how that scales to when the queue is extremely large. This is a complex query, but its basic form is: SELECT <rowdata> FROM <queuetable> WHERE <some conditions> AND NOT EXISTS(<other row-specific conditions in the same table)>) ORDER BY <priority> ASC LIMIT <typically some hundreds of records> Because the queue may be very large, and this query may potentially return ALL records in the queue, the query plan MUST wind up reading directly out of the priority index, or the query simply will not work. It simply cannot afford to read 20 million records into memory and then sort them! The second place performance can be severely impacted is in how parallel writes can be. In postgresql 7.4, for example, everything was single-threaded on writes. This caused web crawling in particular to be poorly performing, because every typical web page has a significant number of links that must be entered in the queue, and single-threading that process cost some 4x to 10x over Postgresql 8.x, which allowed much more parallelism. Hope this helps. > Bundle database server with LCF packaged product > ------------------------------------------------ > > Key: CONNECTORS-55 > URL: https://issues.apache.org/jira/browse/CONNECTORS-55 > Project: Lucene Connector Framework > Issue Type: Improvement > Components: Framework core > Reporter: Jack Krupansky > > The current requirement that the user install and deploy a PostgreSQL server > complicates the installation and deployment of LCF for the user. Installation > and deployment of LCF should be as simple as Solr itself. QuickStart is great > for the low-end and basic evaluation, but a comparable level of simplified > installation and deployment is still needed for full-blown, high-end > environments that need the full performance of a ProstgreSQL-class database > server. So, PostgreSQL should be bundled with the packaged release of LCF so > that installation and deployment of LCF will automatically install and deploy > a subset of the full PostgreSQL distribution that is sufficient for the needs > of LCF. Starting LCF, with or without the LCF UI, should automatically start > the database server. Shutting down LCF should also shutdown the database > server process. > A typical use case would be for a non-developer who is comfortable with Solr > and simply wants to crawl documents from, for example, a SharePoint > repository and feed them into Solr. QuickStart should work well for the low > end or in the early stages of evaluation, but the user would prefer to > evaluate "the real thing" with something resembling a production crawl of > thousands of documents. Such a user might not be a hard-core developer or be > comfortable fiddling with a lot of software components simply to do one > conceptually simple operation. > It should still be possible for the user to supply database server settings > to override the defaults, but the LCF package should have all of the > best-practice settings deemed appropriate for use with LCF. > One downside is that installation and deployment will be platform-specific > since there are multiple processes and PostgreSQL itself requires a > platform-specific installation. > This proposal presumes that PostgreSQL is the best option for the foreseeable > future, but nothing here is intended to preclude support for other database > servers in futures releases. > This proposal should not have any impact on QuickStart packaging or > deployment. > Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.