Re: Archiving documents

2016-09-30 Thread hairymcclarey
 You can also look at sharding options for SolrCloud, e.g. with implicit 
sharding you can choose a sharding field and SolrCloud will index your docs 
into shards based on this field. You could have two shards (and also replicate 
your main shard if you want for distributed searches and fault tolerance) or 
even split your main and archive into several shards depending on size/general 
requirements. You can then very easily search your main shard(s) by adding 
shard=my_main_shard or the entire collection by excluding it. I'm looking at 
this for time-series data where I'll have maybe a shard per year so my shard 
field would be the year, it may make sense to do some "manual" work to merge 
older shards but not sure on this yet.
Alternatively you can use a composite key to be more explicit about whether you 
place your docs in archive or not by using the prefix of the key to denote 
main/archive, and you'd have the same options for searching as above. With this 
you'd need to do some re-indexing as you move stuff in and out of archive - 
sounds like you'd need something like this because you want to be more in 
control of whether a doc is in archive or not.

 

-Original Message-
From: Vasu Y [mailto:vya...@gmail.com] 
Sent: 29 September 2016 14:55
To: solr-user@lucene.apache.org
Subject: Archiving documents

Hi,
 We would like to archive documents based on some criteria (like those that 
were not modified for more than an year OR are least used) in order to reduce 
storage requirements.
I would like hear some of the best practices followed.

How about having main collection and optionally an archive collection (or one 
or more archive collections?) to where we move documents (at regular
intervals) from the main collection based on some criteria (least used or 
modified date etc.) and provide a flag during search whether to include 
archived documents in search or not?

Thanks,
Vasu

   

Re: Archiving documents

2016-09-30 Thread Shawn Heisey
On 9/29/2016 6:55 AM, Vasu Y wrote:
>  We would like to archive documents based on some criteria (like those that
> were not modified for more than an year OR are least used) in order to
> reduce storage requirements.
> I would like hear some of the best practices followed.
>
> How about having main collection and optionally an archive collection (or
> one or more archive collections?) to where we move documents (at regular
> intervals) from the main collection based on some criteria (least used or
> modified date etc.) and provide a flag during search whether to include
> archived documents in search or not?

As long as the collections are using compatible schemas and configs, the
general idea here should work.

If this is SolrCloud, you can create a collection alias that can search
multiple collections.

If it's not SolrCloud, you can still do a distributed search using the
"shards" parameter, but it will be slightly more complicated to set up.

If both schemas have a boolean field for the archive flag, with
documents in the main collection having "false" in that field and
documents in the archive collection having "true" in that field, then
you can include a filter for that flag in your search to limit the
search to one collection or the other.  I think that's probably the best
approach.

Thanks,
Shawn



Re: Archiving documents

2016-09-29 Thread John Bickerstaff
I'm not the expert, but I'm thinking you would need an external process to
handle this.  SOLR itself doesn't seem built to use it's own collection
data to act on collection data (I'd love to be wrong about that).

So - barring any corrections from the committers, I'm imagining you'd need
to write some software that does a query against your collection for the
relevant last_modified_date and then either using the returned solr
document data (if you stored everything) or by re-querying the data from
the original source based on an id -- you would add the document(s) to the
"archive" collection.  Once you were sure all was well with this process,
you could issue a command to delete all the docs with a last_modified_date
past a certain point (from the main collection)

If there's a built-in way to accomplish this - or if others have already
thought this through extensively, I'm certainly interested in hearing about
it.

Good luck!



On Thu, Sep 29, 2016 at 6:55 AM, Vasu Y  wrote:

> Hi,
>  We would like to archive documents based on some criteria (like those that
> were not modified for more than an year OR are least used) in order to
> reduce storage requirements.
> I would like hear some of the best practices followed.
>
> How about having main collection and optionally an archive collection (or
> one or more archive collections?) to where we move documents (at regular
> intervals) from the main collection based on some criteria (least used or
> modified date etc.) and provide a flag during search whether to include
> archived documents in search or not?
>
> Thanks,
> Vasu
>


Archiving documents

2016-09-29 Thread Vasu Y
Hi,
 We would like to archive documents based on some criteria (like those that
were not modified for more than an year OR are least used) in order to
reduce storage requirements.
I would like hear some of the best practices followed.

How about having main collection and optionally an archive collection (or
one or more archive collections?) to where we move documents (at regular
intervals) from the main collection based on some criteria (least used or
modified date etc.) and provide a flag during search whether to include
archived documents in search or not?

Thanks,
Vasu