Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-30 Thread Grant Ingersoll

On Oct 29, 2010, at 6:00 PM, Ron Mayer wrote:

 I have some documents with a bunch of attachments (images, thumbnails
 for them, audio clips, word docs, etc); and am currently dealing with
 them by just putting a path on a filesystem to them in solr; and then
 jumping through hoops of keeping them in sync with solr.
 
 Would it be nuts to stick the image data itself in solr?
 
 More specifically - if I have a bunch of large stored fields,
 would it significantly impact search performance in the
 cases when those fields aren't fetched.

Make sure you have lazy field loading on.

 
 Searches are very common in this system, and it's very rare
 that someone actually opens up one of these attachments
 so I'm not really worried about the time it takes to fetch
 them when someone does actually want one.
 


You would be adding some overhead to the system in that Solr now has to manage 
these files as stored fields.  I guess I would do some benchmarking to see.


--
Grant Ingersoll
http://www.lucidimagination.com



Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-30 Thread Paul Libbrecht
I am quite interested by this story, including sample code.
Back in Lucene 1.4 and 2.0 times, the reader vs string loading abilities was 
inconsistently handled and I switched to have one directory with thousands of 
files for our ActiveMath content storage. It works but fairly badly on smaller 
machines (laptops among others).

If I'd be able to get lazy loading to work faithfully I think it'd be quite a 
win!

The overhead can be handled by just separating the index for mass-storage and 
only store keys in the small storage. That's what I do currently (with 
mass-storage in files).

It'd be good this is better than Hibernate or SQL storage which has always been 
crappy wrt streaming field contents to my experience.

paul

 More specifically - if I have a bunch of large stored fields,
 would it significantly impact search performance in the
 cases when those fields aren't fetched.
 
 Make sure you have lazy field loading on.
 
 
 Searches are very common in this system, and it's very rare
 that someone actually opens up one of these attachments
 so I'm not really worried about the time it takes to fetch
 them when someone does actually want one.
 
 
 You would be adding some overhead to the system in that Solr now has to 
 manage these files as stored fields.  I guess I would do some benchmarking to 
 see.



Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-30 Thread Lance Norskog
There is a binary field type for this problem. Trunk versions now do
not have to base-64 encode but just store the bytes directly (I
think).

There is a quirk: Solr/Lucene field contents (the stored part) are
stored in field order. So all of the fields are in order on the disk.
So, when Lucene loads a document and returns the fields, it walks this
entire sequence on the disk.

If you say get me everything except the big binary, it has to skip
over that long sequence on the disk. This then requires more disk i/o
to load any field, since it has to walk the whole sequence of data.

On Sat, Oct 30, 2010 at 4:16 AM, Paul Libbrecht hopla...@me.com wrote:
 I am quite interested by this story, including sample code.
 Back in Lucene 1.4 and 2.0 times, the reader vs string loading abilities was 
 inconsistently handled and I switched to have one directory with thousands of 
 files for our ActiveMath content storage. It works but fairly badly on 
 smaller machines (laptops among others).

 If I'd be able to get lazy loading to work faithfully I think it'd be quite a 
 win!

 The overhead can be handled by just separating the index for mass-storage 
 and only store keys in the small storage. That's what I do currently (with 
 mass-storage in files).

 It'd be good this is better than Hibernate or SQL storage which has always 
 been crappy wrt streaming field contents to my experience.

 paul

 More specifically - if I have a bunch of large stored fields,
 would it significantly impact search performance in the
 cases when those fields aren't fetched.

 Make sure you have lazy field loading on.


 Searches are very common in this system, and it's very rare
 that someone actually opens up one of these attachments
 so I'm not really worried about the time it takes to fetch
 them when someone does actually want one.


 You would be adding some overhead to the system in that Solr now has to 
 manage these files as stored fields.  I guess I would do some benchmarking 
 to see.





-- 
Lance Norskog
goks...@gmail.com


Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-29 Thread Ron Mayer
I have some documents with a bunch of attachments (images, thumbnails
for them, audio clips, word docs, etc); and am currently dealing with
them by just putting a path on a filesystem to them in solr; and then
jumping through hoops of keeping them in sync with solr.

Would it be nuts to stick the image data itself in solr?

More specifically - if I have a bunch of large stored fields,
would it significantly impact search performance in the
cases when those fields aren't fetched.

Searches are very common in this system, and it's very rare
that someone actually opens up one of these attachments
so I'm not really worried about the time it takes to fetch
them when someone does actually want one.



Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-29 Thread Shashi Kant
On Fri, Oct 29, 2010 at 6:00 PM, Ron Mayer r...@0ape.com wrote:

 I have some documents with a bunch of attachments (images, thumbnails
 for them, audio clips, word docs, etc); and am currently dealing with
 them by just putting a path on a filesystem to them in solr; and then
 jumping through hoops of keeping them in sync with solr.



Not sure why that is an issue. Keeping them in sync with solr would be the
same as storing within a file-system. Why would storing within solr be any
different.


 Would it be nuts to stick the image data itself in solr?

 More specifically - if I have a bunch of large stored fields,
 would it significantly impact search performance in the
 cases when those fields aren't fetched.


Hard to say. Assume you mean storing by converting into a base64 format. If
you do not retrieve the field when fetching, AFAIK should not affect it
significantly, if at all.
So if you manage your retrieval should be fine.


 Searches are very common in this system, and it's very rare
 that someone actually opens up one of these attachments
 so I'm not really worried about the time it takes to fetch
 them when someone does actually want one.