No direct help but a bunch of related random thoughts: 1) How are you running Tika? As a jar loading from scratch every time? Tika can also run in a server mode where it listens to a network socket. You send the file, it sends the extract back. Might be faster.
2) Deleting old stuff. You can index into a new core and then swap the cores out. Heavy on the server, but client will not notice. Or just reindex into the same core but have a timestamp for index-time. Then delete with a query for old timestamp (not reindexed) 3) DIH is ok, but getting old in a tooth. And you are kind of supposed to grow out of it. Maybe look at Flume for more modern take. 4) Security: Maybe ManyfoldCF has something you can use: http://projects.apache.org/projects/manifoldcf.html Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Oct 4, 2013 at 12:29 PM, Sadler, Anthony <anthony.sad...@yrgrp.com>wrote: > Hi all: > > I've had a quick look through the archives but am struggling to find a > decent search query (a bad start to my solr career), so apologies if this > has been asked multiple times before, as I'm sure it has. > > We've got several windows file servers across several locations and we'd > like to index their contents using Solr. So far we've come up with this > setup: > > - 1 Solr server with several collections, collections segregated by file > security needs or line of business. > - At each remote site a linux machine has mounted the relevant local > fileserver's filesystem via SMB/CIFS. > - That server is running a perl script written by yours truly that creates > an XML index of all the files and then submits them to Solr for indexing. > Content of files with certain extensions is indexed using Tika. Happy to > post this script. > > The script is fairly mature and has a few smarts in it, like being able to > do delta updates (not in the solr sense of the word: It'll do a full scan > of a file system then write out a timestamp. Next time it runs it only > grabs files modified since that timestamp). This works... to a point. There > are these problems: > > > --------------------------------------------------------------------------------------------------------------------------------------- > > Time: > ----- > On some servers we're dealing with something in the region of a million or > more files. Indexing that many times takes upwards of 48 hours or more. > While the script is now fairly stable and fault tolerant, that is still a > pretty long time. Part of the reason of the slowness is the content > indexing by Tika, but I've been unable to find a satisfactory alternative. > We could drop the whole content thing, but then what's the point? Half the > beauty of solr/tika is that we >can< do it. > > Projecting from some averages, it'd take the better part of a week to > index one of our file servers. > > Deletes: > -------- > As explained above, once the initial scan takes place all activity > thereafter is limited to files that have changed since $last_run_time. > However this present a problem in that if a file gets deleted from the file > server, we're still going to see it in the search results. There are a few > ways that I can see to get rid of these stale files, but they either won't > work or are evil: > > - Re-index the server. Evil because it'll take half a week. > - Use some filesystem watcher to watch for deletes. Won't work because > we're using a SMB/CIFS share mount. > - Periodically list all the files on the fileserver, diff that against all > the files stored in Solr and delete the differences from Solr, thereby > syncing the two. Evil because... well it just is. I'd be asking Solr for > every record it has, which'll be a doozy of a return variable. Surely there > has to be a more elegant way? > > Security: > --------- > We've worked around this by not indexing some files or separating out into > various collections. As such it is not a huge problem, but has anyone > figured out how to integrate Solr with LDAP? > > > --------------------------------------------------------------------------------------------------------------------------------------- > > DIH: > ---- > Someone will reasonably ask why we're not using the DIH. I tried using > that but found the following: > - It would crash. > - When I stopped it crashing by using the on-error stuff, both in the Tika > subsection and the main part of the DIH config, it still crashed with a > java-out-of-memory error. > - I gave java more memory but it still crashed. > > At that point I gave up for the following reasons: > - DIH and I were not getting along. > - Java and I were not getting along. > - Java and DIH were not getting along. > - All the doco I could find was either really basic or really advanced... > there was no intermediate stuff as far as I could find. > - I realised that I could do what I wanted to do better using perl than I > could with DIH, and this seemed a better solution. > > The perl script has, by and large, been a success. However, we've run up > against the above problems. > > Which now leads me to my ultimate question. Surely other people have been > in this same situation. How did they solve these issues? Is the slow > indexing time simply a function of the large dataset we're wanting to > index? Do we need to throw more oomph at the servers? > > The more I play with Solr, the more I realise I need to learn and the more > I realise I'm way out of my depth, hence this email. > > Thanks > > Anthony > > ________________________________ > > ========================================== > Privileged/Confidential Information may be contained in this message. If > you are not the addressee indicated in this message (or responsible for > delivery of the message to such person), you may not copy or deliver this > message to anyone. In such case, you should destroy this message and kindly > notify the sender by reply email. Please advise immediately if you or your > employer does not consent to email for messages of this kind. Opinions, > conclusions and other information in this message that do not relate to the > official business of Burson-Marsteller shall be understood as neither given > nor endorsed by it. > ========================================== > >