Tom, note about https://issues.apache.org/jira/browse/SOLR-6559 and https://issues.apache.org/jira/browse/SOLR-3585. They seem relevant.
On Fri, Dec 12, 2014 at 7:31 PM, Tom Burton-West <tburt...@umich.edu> wrote: > Thanks everybody for the information. > > Shawn, thanks for bringing up the issues around making sure each document > is indexed ok. With our current architecture, that is important for us. > > Yonik's clarification about streaming really helped me to understand one of > the main advantages of CUSS: > > >>When you add a document, it immediately writes it to a stream where > solr can read it off and index it. When you add a second document, > it's immediately written to the same stream (or at least one of the > open streams), as part of the same udpate request. No separate HTTP > request, No separate update request. > > In our use case, where documents are in the 700K-2MB range, I suspect that > the overhead of opening/closing new requests is dwarfed by the time it > takes to just send the data over the wire and parsing the data. However, > I'm starting to think about whether I can find some time to do some > testing. > > Mikhail, thanks for suggesting looking at DIH, I haven't looked at it in > several years and didn't realize there is now functionality to deal with > XML documents. > > When I asked about being able to read XML files from the filesystem, it was > for the purposes of running some benchmark tests to see if CUSS offers > enough advantages to re-architect our system. > > Currently the main bottleneck in our system is constructing Solr documents. > We use multiple "document producers" which are responsible both for > creating a document and for sending it to Solr. Although each producer > waits until it gets a response from Solr before sending the next document > to be indexed, we run 20-100 producers, so this is similar to CUSS running > multiple threads. (although of course we open a new http request and Solr > update request each time) > > As far as using DIH or something like it, we might be able to use it for > testing with already created documents. > > Creating the documents requires assembling (and massaging) data from > several sources including a few database queries, unzipping files on our > filesystem and contatenating them, and querying another Solr instance which > has metadata. > > I'm now thinking that for testing purposes it might be sufficient to > construct dummy documents as in the examples rather than trying to use our > actual documents. If the speed improvements look significant enough, then > I'd need to figure out how to test with real documents. > > Thanks again for all the input. > > Tom > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>