Yes, this is a CSV Loader. This looks like one of those cases where there are many ways to handle 90% of the requirements but none that solves 100% of the problem. Which is why the CSV loader also almost solves the problem, but not quite.
We're not using solr as a web app, just using the embedded server, which is why we can't use curl and hence CSVLoader. So this is a purely command-line driven application that runs against an embedded Solr server, no web containers, for performance reasons. On Thu, Apr 21, 2011 at 4:47 PM, Yonik Seeley <[email protected]>wrote: > On Thu, Apr 21, 2011 at 7:27 PM, Kiko Aumond <[email protected]> wrote: > > Yes, I've seen that page, but I went a bit beyond the material there, as > the > > code I wrote is able to set parameters such as separators, encapsulators > and > > the index columns, whether to split parameters, auto-commit as well as > the > > ability to do incremental or full index reloads. > > Is this a CSV loader? > If so, did you know the CSV loader (and other data loaders) have the > option to bypass HTTP also and stream directly from a local file (or > other URL)? > > > Also, from what I've seen in DirectSolrConnection (version 1.4.1), you > have > > to supply the document body as a String. We want to avoid havindgto load > > the entire document into memory, which is why we load the files into > > ContentStream objects and pass them to the embedded Solr server (I am > > assuming ContentStream actually streams the file as its name suggests > > instead of trying to load it into memory). The utility I wrote gets a > path, > > a Regex expression for all the files to be loaded, as well as the > parameters > > mentioned above and it does either a full or incremental upload of > multiple > > files with a single command. > > > > We run a very high load application with SOLR in the back end that > requires > > that we use the Embedded solr server to eliminate the network round-trip. > > Even a small incremental gain in performance is important for us. > > Eliminating the network round-trip is certainly important for good > bulk indexing performance. Luckily you don't have to > embed to do that. You can use multiple threads (say 16 for a 4 core > server) that essentially covers up > any round-trip latency (use persistent connections though! or use > SolrJ which does by default), > or you can use the StreamingUpdateSolrServer that eliminates > round-trip network delays > by streaming documents over multiple already open connections. > > -Yonik > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May > 25-26, San Francisco >
