bq: We also had an HDFS setup already so it looked like a good option
to not loos data. Earlier we had a few cases where we lost the
machines so HDFS looked safer for that.

right, that's one of the places where using HDFS to back Solr makes a
lot of sense. The other approach is to just have replicas for each
shard distributed across different physical machines. But whatever
works is fine.

And there are a bunch of parameters you can tune both on HDFS and for
local file systems so "it's more an art than a science".

bq: Frequent adds with commits, which is likely not good in general
anyway, does look quite a bit slower then local storage so far.

I think you can go a long way towards fixing this by doing some
autowarming. I wouldn't want to open a new searcher every second and
do much autowarming over HDFS, but if you can stand less frequent
commits (say every minute?) you might be able to smooth out the
performance....

Best,
Erick

On Wed, Nov 22, 2017 at 11:31 AM, Hendrik Haddorp
<hendrik.hadd...@gmx.net> wrote:
> We actually use no auto warming. Our collections are pretty small and the
> query performance is not really a problem so far. We are using lots of
> collections and most Solr caches seem to be per core and not global so we
> also have a problem with caching. I have to test the HDFS cache some more as
> that should work cross collections.
>
> We also had an HDFS setup already so it looked like a good option to not
> loos data. Earlier we had a few cases where we lost the machines so HDFS
> looked safer for that.
>
> I would expect that the HDFS performance is also quite good if you have lots
> of document adds and not so frequent commits. Frequent adds with commits,
> which is likely not good in general anyway, does look quite a bit slower
> then local storage so far. As we didn't see that in our earlier tests, which
> were more, query focused, I said it large depends on what you are doing.
>
> Hendrik
>
> On 22.11.2017 18:41, Erick Erickson wrote:
>>
>> In my experience, for relatively static indexes the performance is
>> roughly similar. Once the data is read from whatever data source it's
>> in memory, where the data came from is (largely) secondary in
>> importance.
>>
>> In cases where there's a lot of I/O I expect HDFS to be slower, this
>> fits Hendrik's observation: "We now had a patter with lots of small
>> updates and commits and that seems to be quite a bit slower". He's
>> merging segments and (presumably) autowarming frequently, implying
>> lots of I/O and HDFS adds an extra layer.
>>
>> Personally I'd use whichever is most convenient and see if the
>> performance was "good enough". I wouldn't recommend _installing_ HDFS
>> just to use it with Solr, why add another complication? If you need
>> the redundancy add replicas. If you already have the HDFS
>> infrastructure in place and using HDFS is easier than local storage,
>> feel free....
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
>> <greenhorntec...@gmail.com> wrote:
>>>
>>> Hendrik,
>>>
>>> Thanks for your response.
>>>
>>> Regarding "But this seems to greatly depend on how your setup looks like
>>> and what actions you perform." May I know what are the factors influence
>>> and what considerations are to be taken in relation to this?
>>>
>>> Thanks
>>>
>>> On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp <hendrik.hadd...@gmx.net>
>>> wrote:
>>>
>>>> We did some testing and the performance was strangely even better with
>>>> HDFS then the with the local file system. But this seems to greatly
>>>> depend on how your setup looks like and what actions you perform. We now
>>>> had a patter with lots of small updates and commits and that seems to be
>>>> quite a bit slower. We are about to do performance testing on that now.
>>>>
>>>> The reason we switched to HDFS was largely connected to us using Docker
>>>> and Marathon/Mesos. With HDFS the data is in a shared file system and
>>>> thus it is possible to move the replica to a different instance on a a
>>>> different host.
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>> On 22.11.2017 14:59, Greenhorn Techie wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Good Afternoon!!
>>>>>
>>>>> While the discussion around issues related to "Solr on HDFS" is live, I
>>>>> would like to understand if anyone has done any performance
>>>>> benchmarking
>>>>> for both Solr indexing and search between HDFS vs local file system.
>>>>>
>>>>> Also, from experience, what would the community folks suggest? Solr on
>>>>> local file system or Solr on HDFS? Has anyone done a comparative study
>>>>> of
>>>>> these choices?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>

Reply via email to