Re: Indexing file system contents

Alexandre Rafalovitch Fri, 04 Oct 2013 00:08:52 -0700

No direct help but a bunch of related random thoughts:

1) How are you running Tika? As a jar loading from scratch every time? Tika
can also run in a server mode where it listens to a network socket. You
send the file, it sends the extract back. Might be faster.


2) Deleting old stuff. You can index into a new core and then swap the
cores out. Heavy on the server, but client will not notice. Or just reindex
into the same core but have a timestamp for index-time. Then delete with a
query for old timestamp (not reindexed)

3) DIH is ok, but getting old in a tooth. And you are kind of supposed to
grow out of it. Maybe look at Flume for more modern take.

4) Security: Maybe ManyfoldCF has something you can use:
http://projects.apache.org/projects/manifoldcf.html

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Oct 4, 2013 at 12:29 PM, Sadler, Anthony
<anthony.sad...@yrgrp.com>wrote:

> Hi all:
>
> I've had a quick look through the archives but am struggling to find a
> decent search query (a bad start to my solr career), so apologies if this
> has been asked multiple times before, as I'm sure it has.
>
> We've got several windows file servers across several locations and we'd
> like to index their contents using Solr. So far we've come up with this
> setup:
>
> - 1 Solr server with several collections, collections segregated by file
> security needs or line of business.
> - At each remote site a linux machine has mounted the relevant local
> fileserver's filesystem via SMB/CIFS.
> - That server is running a perl script written by yours truly that creates
> an XML index of all the files and then submits them to Solr for indexing.
> Content of files with certain extensions is indexed using Tika. Happy to
> post this script.
>
> The script is fairly mature and has a few smarts in it, like being able to
> do delta updates (not in the solr sense of the word: It'll do a full scan
> of a file system then write out a timestamp. Next time it runs it only
> grabs files modified since that timestamp). This works... to a point. There
> are these problems:
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Time:
> -----
> On some servers we're dealing with something in the region of a million or
> more files. Indexing that many times takes upwards of 48 hours or more.
> While the script is now fairly stable and fault tolerant, that is still a
> pretty long time. Part of the reason of the slowness is the content
> indexing by Tika, but I've been unable to find a satisfactory alternative.
> We could drop the whole content thing, but then what's the point? Half the
> beauty of solr/tika is that we >can< do it.
>
> Projecting from some averages, it'd take the better part of a week to
> index one of our file servers.
>
> Deletes:
> --------
> As explained above, once the initial scan takes place all activity
> thereafter is limited to files that have changed since $last_run_time.
> However this present a problem in that if a file gets deleted from the file
> server, we're still going to see it in the search results. There are a few
> ways that I can see to get rid of these stale files, but they either won't
> work or are evil:
>
> - Re-index the server. Evil because it'll take half a week.
> - Use some filesystem watcher to watch for deletes. Won't work because
> we're using a SMB/CIFS share mount.
> - Periodically list all the files on the fileserver, diff that against all
> the files stored in Solr and delete the differences from Solr, thereby
> syncing the two. Evil because... well it just is. I'd be asking Solr for
> every record it has, which'll be a doozy of a return variable. Surely there
> has to be a more elegant way?
>
> Security:
> ---------
> We've worked around this by not indexing some files or separating out into
> various collections. As such it is not a huge problem, but has anyone
> figured out how to integrate Solr with LDAP?
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> DIH:
> ----
> Someone will reasonably ask why we're not using the DIH. I tried using
> that but found the following:
> - It would crash.
> - When I stopped it crashing by using the on-error stuff, both in the Tika
> subsection and the main part of the DIH config, it still crashed with a
> java-out-of-memory error.
> - I gave java more memory but it still crashed.
>
> At that point I gave up for the following reasons:
> - DIH and I were not getting along.
> - Java and I were not getting along.
> - Java and DIH were not getting along.
> - All the doco I could find was either really basic or really advanced...
> there was no intermediate stuff as far as I could find.
> - I realised that I could do what I wanted to do better using perl than I
> could with DIH, and this seemed a better solution.
>
> The perl script has, by and large, been a success. However, we've run up
> against the above problems.
>
> Which now leads me to my ultimate question. Surely other people have been
> in this same situation. How did they solve these issues? Is the slow
> indexing time simply a function of the large dataset we're wanting to
> index? Do we need to throw more oomph at the servers?
>
> The more I play with Solr, the more I realise I need to learn and the more
> I realise I'm way out of my depth, hence this email.
>
> Thanks
>
> Anthony
>
> ________________________________
>
> ==========================================
> Privileged/Confidential Information may be contained in this message. If
> you are not the addressee indicated in this message (or responsible for
> delivery of the message to such person), you may not copy or deliver this
> message to anyone. In such case, you should destroy this message and kindly
> notify the sender by reply email. Please advise immediately if you or your
> employer does not consent to email for messages of this kind. Opinions,
> conclusions and other information in this message that do not relate to the
> official business of Burson-Marsteller shall be understood as neither given
> nor endorsed by it.
> ==========================================
>
>

Re: Indexing file system contents

Reply via email to