date:20171122


On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote:

I'm doing the merging on the SSD drive, the speed should be ok?


The speed of virtually all modern disks will have almost no influence on 
the speed of the merge.  The bottleneck isn't disk transfer speed, it's 
the operation of the merge code in Lucene.


As I said earlier in this thread, a merge is **NOT** just a copy. Lucene 
must completely rebuild the data structures of the index to incorporate 
all of the segments of the source indexes into a single segment in the 
target index, while simultaneously *excluding* information from 
documents that have been deleted.


The best speed I have ever personally seen for a merge is 30 megabytes 
per second.  This is far below the sustained transfer rate of a typical 
modern SATA disk.  SSD is capable of far faster data transfer ...but it 
will NOT make merges go any faster.



We need to merge because the data are indexed in two different collections,
and we need them to be under the same collection, so that we can do things
like faceting more accurately.
Will sharding alone achieve this? Or do we have to merge first before we do
the sharding?


If you want the final index to be sharded, it's typically best to index 
from scratch into a new empty collection that has the number of shards 
you want.  The merging tool you're using isn't aware of concepts like 
shards.  It combines everything into a single index.


It's not entirely clear what you're asking with the question about 
sharding alone.  Making a guess:  I have never heard of facet accuracy 
being affected by whether or not the index is sharded.  If that *is* 
possible, then I would expect an index that is NOT sharded to have 
better accuracy.


Thanks,
Shawn

Re: NullPointerException in PeerSync.handleUpdates

2017-11-22 Thread Michael Braun

I went ahead and resolved the jira - it was never seen again by us in later
versions of Solr. There are a number of bug fixes since the 6.2 release, so
I personally recommend updating!

On Wed, Nov 22, 2017 at 11:48 AM, Pushkar Raste 
wrote:

> As mentioned in the JIRA, exception seems to be coming from a log
> statement. The issue was fixed in 6.3, here is relevant line f rom 6.3
> https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.3.0/solr/core/src/java/org/apache/solr/
> update/PeerSync.java#L707
>
>
>
> On Wed, Nov 22, 2017 at 1:18 AM, Erick Erickson 
> wrote:
>
> > Right, if there's no "fixed version" mentioned and if the resolution
> > is "unresolved", it's not in the code base at all. But that JIRA is
> > not apparently reproducible, especially on more recent versions that
> > 6.2. Is it possible to test a more recent version (6.6.2 would be my
> > recommendation).
> >
> > Erick
> >
> > On Tue, Nov 21, 2017 at 9:58 PM, S G  wrote:
> > > My bad. I found it at https://issues.apache.org/jira/browse/SOLR-9453
> > > But I could not find it in changes.txt perhaps because its yet not
> > resolved.
> > >
> > > On Tue, Nov 21, 2017 at 9:15 AM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > >> Did you check the JIRA list? Or CHANGES.txt in more recent versions?
> > >>
> > >> On Tue, Nov 21, 2017 at 1:13 AM, S G 
> wrote:
> > >> > Hi,
> > >> >
> > >> > We are running 6.2 version of Solr and hitting this error
> frequently.
> > >> >
> > >> > Error while trying to recover. core=my_core:java.lang.
> > >> NullPointerException
> > >> > at org.apache.solr.update.PeerSync.handleUpdates(
> > >> PeerSync.java:605)
> > >> > at org.apache.solr.update.PeerSync.handleResponse(
> > >> PeerSync.java:344)
> > >> > at org.apache.solr.update.PeerSync.sync(PeerSync.java:257)
> > >> > at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> > >> RecoveryStrategy.java:376)
> > >> > at org.apache.solr.cloud.RecoveryStrategy.run(
> > >> RecoveryStrategy.java:221)
> > >> > at java.util.concurrent.Executors$RunnableAdapter.
> > >> call(Executors.java:511)
> > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >> > at org.apache.solr.common.util.ExecutorUtil$
> > >> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> > >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > >> ThreadPoolExecutor.java:1142)
> > >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > >> ThreadPoolExecutor.java:617)
> > >> > at java.lang.Thread.run(Thread.java:745)
> > >> >
> > >> >
> > >> >
> > >> > Is this a known issue and fixed in some newer version?
> > >> >
> > >> >
> > >> > Thanks
> > >> > SG
> > >>
> >
>

Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo

Hi Erick,

Yes, we are planning to do sharding when we upgrade to the newer Solr
7.1.0, and probably will re-index everything. But currently we are waiting
for certain issues on indexing the EML files to Solr 7.1.0 to be addressed
first, like for this JIRA, https://issues.apache.org/jira/browse/SOLR-11622,
which currently gives the following error when indexing EML files.

java.lang.NoClassDefFoundError:
org/apache/james/mime4j/stream/MimeConfig$Builder


Meanwhile, as we are still on Solr 6.5.1, we plan to just merge the index,
so that customer can continue to access the current index. The re-indexing
will likely to take 3 to 4 weeks too, given the size of the data. Also, is
there any way to do sharding for our current index size of 3.5TB, or is
re-index the only way?

Regards,
Edwin


On 23 November 2017 at 09:31, Erick Erickson 
wrote:

> Sure, sharding can give you accurate faceting, although do note there
> are nuances, JSON faceting can occasionally be not exact, although
> there are JIRAs being worked on to correct this.
>
> "traditional" faceting has a refinement phase that gets accurate counts.
>
> But the net-net is that I believe your merging is just the first of
> many problems you'll encounter with indexes this size and starting
> over with a reasonable sharding strategy is likely the fastest path to
> what you want.
>
> Merging indexes isn't going to work for you though, you'll have to
> create a new collection and reindex everything. As a straw-man
> recommendation, I'd put no more than 200G on each shard in terms of
> index size.
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo
>  wrote:
> > I'm doing the merging on the SSD drive, the speed should be ok?
> >
> > We need to merge because the data are indexed in two different
> collections,
> > and we need them to be under the same collection, so that we can do
> things
> > like faceting more accurately.
> > Will sharding alone achieve this? Or do we have to merge first before we
> do
> > the sharding?
> >
> > Regards,
> > Edwin
> >
> > On 23 November 2017 at 01:32, Erick Erickson 
> > wrote:
> >
> >> Really, let's back up here though. This sure seems like an XY problem.
> >> You're merging indexes that will eventually be something on the order
> >> of 3.5TB. I claim that an index of that size is very difficult to work
> >> with effectively. _Why_ do you want to do this? Do you have any
> >> evidence that you'll be able to effectively use it?
> >>
> >> And Shawn tells you that the result will be one large segment. If you
> >> replace documents in that index, it will consist of around 3.4975T
> >> wasted space before the segment is merged, see:
> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-
> >> documents-optimize-may-bad/.
> >>
> >> You already know that merging is extremely painful. This sure seems
> >> like a case where the evidence is mounting that you would be far
> >> better off sharding and _not_ merging.
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey 
> wrote:
> >> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> >> >> I am using the IndexMergeTool from Solr, from the command below:
> >> >>
> >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> >> >> org.apache.lucene.misc.IndexMergeTool
> >> >>
> >> >> The heap size is 32GB. There are more than 20 million documents in
> the
> >> two
> >> >> cores.
> >> >
> >> > I have looked at IndexMergeTool, and confirmed that it does its job in
> >> > exactly the same way that Solr does an optimize, so I would still
> expect
> >> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
> >> > hardware that can't transfer data that quickly.
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >>
>

Re: DelimitedPayloadTokenFilterFactory missing from ref guide

2017-11-22 Thread John Anonymous

I would love to add to the documentation, but I don't know enough about
this filter to do so.  I was under the impression that this filter would
store my payloads and then strip the payload characters from the indexed
text.  I am getting good results from the payload_check parser, but my
indexed text currently looks something like this: "The|0 big|1 tree|1"
where '|' is the payload for each token.  I have no idea if this is
the intended functionality or not.  I want to be able to use tokens, but I
want the text returned to look like "The big tree" without the tokens.  Any
ideas?
Thanks!

On Wed, Nov 22, 2017 at 7:06 PM, Erick Erickson 
wrote:

> Thanks for noticing. Note that anyone can edit the asciidoc pages and
> submit a patch, it'd be great if you could submit a patch and add it
> to a JIRA.
>
> See the Write/Improve User Documentation section here:
> https://wiki.apache.org/solr/HowToContribute
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 3:37 PM, John Anonymous  wrote:
> > DelimitedPayloadTokenFilterFactory appears to be missing from this page:
> > https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html
>

Re: Merging of index in Solr

Sure, sharding can give you accurate faceting, although do note there
are nuances, JSON faceting can occasionally be not exact, although
there are JIRAs being worked on to correct this.

"traditional" faceting has a refinement phase that gets accurate counts.

But the net-net is that I believe your merging is just the first of
many problems you'll encounter with indexes this size and starting
over with a reasonable sharding strategy is likely the fastest path to
what you want.

Merging indexes isn't going to work for you though, you'll have to
create a new collection and reindex everything. As a straw-man
recommendation, I'd put no more than 200G on each shard in terms of
index size.

Best,
Erick

On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo
 wrote:
> I'm doing the merging on the SSD drive, the speed should be ok?
>
> We need to merge because the data are indexed in two different collections,
> and we need them to be under the same collection, so that we can do things
> like faceting more accurately.
> Will sharding alone achieve this? Or do we have to merge first before we do
> the sharding?
>
> Regards,
> Edwin
>
> On 23 November 2017 at 01:32, Erick Erickson 
> wrote:
>
>> Really, let's back up here though. This sure seems like an XY problem.
>> You're merging indexes that will eventually be something on the order
>> of 3.5TB. I claim that an index of that size is very difficult to work
>> with effectively. _Why_ do you want to do this? Do you have any
>> evidence that you'll be able to effectively use it?
>>
>> And Shawn tells you that the result will be one large segment. If you
>> replace documents in that index, it will consist of around 3.4975T
>> wasted space before the segment is merged, see:
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-
>> documents-optimize-may-bad/.
>>
>> You already know that merging is extremely painful. This sure seems
>> like a case where the evidence is mounting that you would be far
>> better off sharding and _not_ merging.
>>
>> FWIW,
>> Erick
>>
>> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey  wrote:
>> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
>> >> I am using the IndexMergeTool from Solr, from the command below:
>> >>
>> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
>> >> org.apache.lucene.misc.IndexMergeTool
>> >>
>> >> The heap size is 32GB. There are more than 20 million documents in the
>> two
>> >> cores.
>> >
>> > I have looked at IndexMergeTool, and confirmed that it does its job in
>> > exactly the same way that Solr does an optimize, so I would still expect
>> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
>> > hardware that can't transfer data that quickly.
>> >
>> > Thanks,
>> > Shawn
>> >
>>

Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo

I'm doing the merging on the SSD drive, the speed should be ok?

We need to merge because the data are indexed in two different collections,
and we need them to be under the same collection, so that we can do things
like faceting more accurately.
Will sharding alone achieve this? Or do we have to merge first before we do
the sharding?

Regards,
Edwin

On 23 November 2017 at 01:32, Erick Erickson 
wrote:

> Really, let's back up here though. This sure seems like an XY problem.
> You're merging indexes that will eventually be something on the order
> of 3.5TB. I claim that an index of that size is very difficult to work
> with effectively. _Why_ do you want to do this? Do you have any
> evidence that you'll be able to effectively use it?
>
> And Shawn tells you that the result will be one large segment. If you
> replace documents in that index, it will consist of around 3.4975T
> wasted space before the segment is merged, see:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-
> documents-optimize-may-bad/.
>
> You already know that merging is extremely painful. This sure seems
> like a case where the evidence is mounting that you would be far
> better off sharding and _not_ merging.
>
> FWIW,
> Erick
>
> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey  wrote:
> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> >> I am using the IndexMergeTool from Solr, from the command below:
> >>
> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> >> org.apache.lucene.misc.IndexMergeTool
> >>
> >> The heap size is 32GB. There are more than 20 million documents in the
> two
> >> cores.
> >
> > I have looked at IndexMergeTool, and confirmed that it does its job in
> > exactly the same way that Solr does an optimize, so I would still expect
> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
> > hardware that can't transfer data that quickly.
> >
> > Thanks,
> > Shawn
> >
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hmm. This is quite possible. Any time things take "too long" it can be
 a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:
> Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported
> by:
> hadoop fs -du -s -h /solr6.6.0
> 29.9 T  89.9 T  /solr6.6.0
>
> The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1
> billion documents indexed and we index about 2.5 million documents per day.
> Assuming an even distribution, each node is handling about 680GBytes of
> index.  So our cache size is 1.4%. Perhaps 'relatively small block cache'
> was an understatement! This is why we split the largest collection into two,
> where one is data going back 30 days, and the other is all the data.  Most
> of our searches are not longer than 30 days back.  The 30 day index is
> 2.6TBytes total.  I don't know how the HDFS block cache splits between
> collections, but the 30 day index performs acceptable for our specific
> application.
>
> If we wanted to cache 50% of the index, each of our 45 nodes would need a
> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>
> What I believe caused our 'recovery, fail, retry loop' was one of our
> servers died.  This caused HDFS to start to replicate blocks across the
> cluster and produced a lot of network activity.  When this happened, I
> believe there was high network contention for specific nodes in the cluster
> and their network interfaces became pegged and requests for HDFS blocks
> timed out.  When that happened, SolrCloud went into recovery which caused
> more network traffic.  Fun stuff.
>
> -Joe
>
>
> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>
>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>
>>> Right now, we have a relatively small block cache due to the
>>> requirements that the servers run other software.  We tried to find
>>> the best balance between block cache size, and RAM for programs, while
>>> still giving enough for local FS cache.  This came out to be 84 128M
>>> blocks - or about 10G for the cache per node (45 nodes total).
>>
>> How much data is being handled on a server with 10GB allocated for
>> caching HDFS data?
>>
>> The first message in this thread says the index size is 31TB, which is
>> *enormous*.  You have also said that the index takes 93TB of disk
>> space.  If the data is distributed somewhat evenly, then the answer to
>> my question would be that each of those 45 Solr servers would be
>> handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.
>>
>> When index data that Solr needs to access for an operation is not in the
>> cache and Solr must actually wait for disk and/or network I/O, the
>> resulting performance usually isn't very good.  In most cases you don't
>> need to have enough memory to fully cache the index data ... but less
>> than half a percent is not going to be enough.
>>
>> Thanks,
>> Shawn
>>
>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>

Re: Solr on HDFS vs local storage - Benchmarking

bq: We also had an HDFS setup already so it looked like a good option
to not loos data. Earlier we had a few cases where we lost the
machines so HDFS looked safer for that.

right, that's one of the places where using HDFS to back Solr makes a
lot of sense. The other approach is to just have replicas for each
shard distributed across different physical machines. But whatever
works is fine.

And there are a bunch of parameters you can tune both on HDFS and for
local file systems so "it's more an art than a science".

bq: Frequent adds with commits, which is likely not good in general
anyway, does look quite a bit slower then local storage so far.

I think you can go a long way towards fixing this by doing some
autowarming. I wouldn't want to open a new searcher every second and
do much autowarming over HDFS, but if you can stand less frequent
commits (say every minute?) you might be able to smooth out the
performance

Best,
Erick

On Wed, Nov 22, 2017 at 11:31 AM, Hendrik Haddorp
 wrote:
> We actually use no auto warming. Our collections are pretty small and the
> query performance is not really a problem so far. We are using lots of
> collections and most Solr caches seem to be per core and not global so we
> also have a problem with caching. I have to test the HDFS cache some more as
> that should work cross collections.
>
> We also had an HDFS setup already so it looked like a good option to not
> loos data. Earlier we had a few cases where we lost the machines so HDFS
> looked safer for that.
>
> I would expect that the HDFS performance is also quite good if you have lots
> of document adds and not so frequent commits. Frequent adds with commits,
> which is likely not good in general anyway, does look quite a bit slower
> then local storage so far. As we didn't see that in our earlier tests, which
> were more, query focused, I said it large depends on what you are doing.
>
> Hendrik
>
> On 22.11.2017 18:41, Erick Erickson wrote:
>>
>> In my experience, for relatively static indexes the performance is
>> roughly similar. Once the data is read from whatever data source it's
>> in memory, where the data came from is (largely) secondary in
>> importance.
>>
>> In cases where there's a lot of I/O I expect HDFS to be slower, this
>> fits Hendrik's observation: "We now had a patter with lots of small
>> updates and commits and that seems to be quite a bit slower". He's
>> merging segments and (presumably) autowarming frequently, implying
>> lots of I/O and HDFS adds an extra layer.
>>
>> Personally I'd use whichever is most convenient and see if the
>> performance was "good enough". I wouldn't recommend _installing_ HDFS
>> just to use it with Solr, why add another complication? If you need
>> the redundancy add replicas. If you already have the HDFS
>> infrastructure in place and using HDFS is easier than local storage,
>> feel free
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
>>  wrote:
>>>
>>> Hendrik,
>>>
>>> Thanks for your response.
>>>
>>> Regarding "But this seems to greatly depend on how your setup looks like
>>> and what actions you perform." May I know what are the factors influence
>>> and what considerations are to be taken in relation to this?
>>>
>>> Thanks
>>>
>>> On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
>>> wrote:
>>>
 We did some testing and the performance was strangely even better with
 HDFS then the with the local file system. But this seems to greatly
 depend on how your setup looks like and what actions you perform. We now
 had a patter with lots of small updates and commits and that seems to be
 quite a bit slower. We are about to do performance testing on that now.

 The reason we switched to HDFS was largely connected to us using Docker
 and Marathon/Mesos. With HDFS the data is in a shared file system and
 thus it is possible to move the replica to a different instance on a a
 different host.

 regards,
 Hendrik

 On 22.11.2017 14:59, Greenhorn Techie wrote:
>
> Hi,
>
> Good Afternoon!!
>
> While the discussion around issues related to "Solr on HDFS" is live, I
> would like to understand if anyone has done any performance
> benchmarking
> for both Solr indexing and search between HDFS vs local file system.
>
> Also, from experience, what would the community folks suggest? Solr on
> local file system or Solr on HDFS? Has anyone done a comparative study
> of
> these choices?
>
> Thanks
>

>

Re: Embedded SOLR - Best practice?

I don't really understand what you're saying here. Solr is pretty
fast, why not just put all 400K docs on a Solr instance and just use
that?

EmbeddedSolrServer works, but it really seems like for a small corpus
like this just using a separate stand-alone Solr is way easier.

Best,
Erick

On Wed, Nov 22, 2017 at 11:39 AM, hvengurlekar  wrote:
> Hello Folks,
> Currently, I am using SOLR in production and the around 40 documents are
> stored. For a particular use-case, I am looking to cache around 10
> documents daily. I am thinking of using embedded solr as the cache to
> support the range queries. I did not find any good documentation regarding
> why embedded solr is not a best practice?
> Could you guys please help me out? Thanks!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: DelimitedPayloadTokenFilterFactory missing from ref guide

Thanks for noticing. Note that anyone can edit the asciidoc pages and
submit a patch, it'd be great if you could submit a patch and add it
to a JIRA.

See the Write/Improve User Documentation section here:
https://wiki.apache.org/solr/HowToContribute

Best,
Erick

On Wed, Nov 22, 2017 at 3:37 PM, John Anonymous  wrote:
> DelimitedPayloadTokenFilterFactory appears to be missing from this page:
> https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html

DelimitedPayloadTokenFilterFactory missing from ref guide

2017-11-22 Thread John Anonymous

DelimitedPayloadTokenFilterFactory appears to be missing from this page:
https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html

Embedded SOLR - Best practice?

2017-11-22 Thread hvengurlekar

Hello Folks,
Currently, I am using SOLR in production and the around 40 documents are
stored. For a particular use-case, I am looking to cache around 10
documents daily. I am thinking of using embedded solr as the cache to
support the range queries. I did not find any good documentation regarding
why embedded solr is not a best practice?
Could you guys please help me out? Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp

We actually use no auto warming. Our collections are pretty small and 
the query performance is not really a problem so far. We are using lots 
of collections and most Solr caches seem to be per core and not global 
so we also have a problem with caching. I have to test the HDFS cache 
some more as that should work cross collections.


We also had an HDFS setup already so it looked like a good option to not 
loos data. Earlier we had a few cases where we lost the machines so HDFS 
looked safer for that.


I would expect that the HDFS performance is also quite good if you have 
lots of document adds and not so frequent commits. Frequent adds with 
commits, which is likely not good in general anyway, does look quite a 
bit slower then local storage so far. As we didn't see that in our 
earlier tests, which were more, query focused, I said it large depends 
on what you are doing.


Hendrik

On 22.11.2017 18:41, Erick Erickson wrote:

In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.

In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.

Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free

Best,
Erick


On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
 wrote:

Hendrik,

Thanks for your response.

Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?

Thanks

On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
wrote:


We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.

The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.

regards,
Hendrik

On 22.11.2017 14:59, Greenhorn Techie wrote:

Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger

Hi Shawn - thank you for your reply.  The index is 29.9TBytes as 
reported by:

hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about 
1.1 billion documents indexed and we index about 2.5 million documents 
per day.  Assuming an even distribution, each node is handling about 
680GBytes of index.  So our cache size is 1.4%. Perhaps 'relatively 
small block cache' was an understatement! This is why we split the 
largest collection into two, where one is data going back 30 days, and 
the other is all the data.  Most of our searches are not longer than 30 
days back.  The 30 day index is 2.6TBytes total.  I don't know how the 
HDFS block cache splits between collections, but the 30 day index 
performs acceptable for our specific application.


If we wanted to cache 50% of the index, each of our 45 nodes would need 
a block cache of about 350GBytes.  I'm accepting offers of DIMMs!


What I believe caused our 'recovery, fail, retry loop' was one of our 
servers died.  This caused HDFS to start to replicate blocks across the 
cluster and produced a lot of network activity.  When this happened, I 
believe there was high network contention for specific nodes in the 
cluster and their network interfaces became pegged and requests for HDFS 
blocks timed out.  When that happened, SolrCloud went into recovery 
which caused more network traffic.  Fun stuff.


-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com

Re: highlight separator

2017-11-22 Thread David Hastings

Thanks, I kind of figured that was the case.

On Wed, Nov 22, 2017 at 12:24 PM, Erick Erickson 
wrote:

> I think that's only for the Unified Highlighter, which was introduced
> to Lucene in 6.3 and Solr in 6.4. See: SOLR-9708
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 9:01 AM, David Hastings
>  wrote:
> > Im on solr 5.x at the moment, and am trying to get the highlighter to
> > display complete sentences containing the match.  setting:
> >
> > 'hl.method' => 'fastVector',
> > 'hl.bs.type' =>'SENTENCE',
> >
> > hasnt been proving to work.  is there a way for me to do it in the query
> > itself?
> > thanks
> > -Dave
>

Re: Result grouping performance

Have you enabled docValues (and reindexed from scratch) on the field
you're grouping on?

Best,
Erick

On Wed, Nov 22, 2017 at 5:13 AM, Kempelen, Ákos
 wrote:
> Hello,
>
> I am migrating our codebase from Solr 4.7 to 7.0.1 but the performance of 
> result grouping seems very poor using the newer Solr.
> For example a simple MatchAllDocsQuery takes 5 sec on Solr4.7, and 21 sec on 
> Solr7.
> I wonder what causes the x4 difference in time? We hoped that newer Solr 
> versions will provide better performances...
> Using Field collapsing could would be a solution, but it produces different 
> facet counts.
> Thanks,
> Akos
>
>

Re: Solr on HDFS vs local storage - Benchmarking

In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.

In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.

Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free

Best,
Erick

On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
 wrote:
> Hendrik,
>
> Thanks for your response.
>
> Regarding "But this seems to greatly depend on how your setup looks like
> and what actions you perform." May I know what are the factors influence
> and what considerations are to be taken in relation to this?
>
> Thanks
>
> On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
> wrote:
>
>> We did some testing and the performance was strangely even better with
>> HDFS then the with the local file system. But this seems to greatly
>> depend on how your setup looks like and what actions you perform. We now
>> had a patter with lots of small updates and commits and that seems to be
>> quite a bit slower. We are about to do performance testing on that now.
>>
>> The reason we switched to HDFS was largely connected to us using Docker
>> and Marathon/Mesos. With HDFS the data is in a shared file system and
>> thus it is possible to move the replica to a different instance on a a
>> different host.
>>
>> regards,
>> Hendrik
>>
>> On 22.11.2017 14:59, Greenhorn Techie wrote:
>> > Hi,
>> >
>> > Good Afternoon!!
>> >
>> > While the discussion around issues related to "Solr on HDFS" is live, I
>> > would like to understand if anyone has done any performance benchmarking
>> > for both Solr indexing and search between HDFS vs local file system.
>> >
>> > Also, from experience, what would the community folks suggest? Solr on
>> > local file system or Solr on HDFS? Has anyone done a comparative study of
>> > these choices?
>> >
>> > Thanks
>> >
>>
>>

Re: Merging of index in Solr

Really, let's back up here though. This sure seems like an XY problem.
You're merging indexes that will eventually be something on the order
of 3.5TB. I claim that an index of that size is very difficult to work
with effectively. _Why_ do you want to do this? Do you have any
evidence that you'll be able to effectively use it?

And Shawn tells you that the result will be one large segment. If you
replace documents in that index, it will consist of around 3.4975T
wasted space before the segment is merged, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.

You already know that merging is extremely painful. This sure seems
like a case where the evidence is mounting that you would be far
better off sharding and _not_ merging.

FWIW,
Erick

On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey  wrote:
> On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
>> I am using the IndexMergeTool from Solr, from the command below:
>>
>> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
>> org.apache.lucene.misc.IndexMergeTool
>>
>> The heap size is 32GB. There are more than 20 million documents in the two
>> cores.
>
> I have looked at IndexMergeTool, and confirmed that it does its job in
> exactly the same way that Solr does an optimize, so I would still expect
> a rate of 20 to 30 MB per second, unless it's running on REALLY old
> hardware that can't transfer data that quickly.
>
> Thanks,
> Shawn
>

Re: highlight separator

I think that's only for the Unified Highlighter, which was introduced
to Lucene in 6.3 and Solr in 6.4. See: SOLR-9708

Best,
Erick

On Wed, Nov 22, 2017 at 9:01 AM, David Hastings
 wrote:
> Im on solr 5.x at the moment, and am trying to get the highlighter to
> display complete sentences containing the match.  setting:
>
> 'hl.method' => 'fastVector',
> 'hl.bs.type' =>'SENTENCE',
>
> hasnt been proving to work.  is there a way for me to do it in the query
> itself?
> thanks
> -Dave

Re: Do i need to reindex after changing similarity setting

2017-11-22 Thread Nawab Zada Asad Iqbal

Thanks Walter

On Mon, Nov 20, 2017 at 4:59 PM Walter Underwood 
wrote:

> Similarity is query time.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 20, 2017, at 4:57 PM, Nawab Zada Asad Iqbal 
> wrote:
> >
> > Hi,
> >
> > I want to switch to Classic similarity instead of BM25 (default in
> solr7).
> > Do I need to reindex all cores after this? Or is it only a query time
> > setting?
> >
> >
> > Thanks
> > Nawab
>
>

Re: Reusable tokenstream

Hi Roxana,
The idea with update request processor is to have following parameters:
* inputField - document field with text to analyse
* sharedAnalysis - field type with shared analysis definition
* targetFields - comma separated list of fields where results should be stored.
* fieldSpecificAnalysis - comma separated list of field types that defines 
specifics for each field (reusing schema will have extra tokenizer that should 
be ignored)

Your update processor uses TeeSinkTokenFilter to create tokens for each field, 
but you do not write those tokens to index. You add new fields to document 
where each token is new value (or can concat and have whitespace tokenizer in 
indexing analysis chain of target field). You can remove inputField from 
document.

HTH,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Nov 2017, at 17:46, Roxana Danger  wrote:
> 
> Hi Emir,
> In this case, I need more control at Lucene level, so I have to use the
> lucene index writer directly. So, I can not use Solr for importing.
> Or, is there anyway I can add a tokenstream to a SolrInputDocument (is
> there any other class exposed by Solr during indexing that I can use for
> this purpose?).
> Am I correct or still missing something?
> Thank you.
> 
> 
> On Wed, Nov 22, 2017 at 11:33 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Roxana,
>> I think you can use https://lucene.apache.org/core/5_4_0/analyzers-common/
>> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html <
>> https://lucene.apache.org/core/5_4_0/analyzers-common/
>> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html> like suggested
>> earlier.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 22 Nov 2017, at 11:43, Roxana Danger  wrote:
>>> 
>>> Hi Emir,
>>> Many thanks for your reply.
>>> The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
>>> > apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String,
>>> java.io.Reader)> the way to obtain a previous generated tokenstream? is
>> it
>>> guarantee to get access to the token stream and not reconstruct it?
>>> Thanks,
>>> Roxana
>>> 
>>> 
>>> On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
 Hi Roxana,
 I don’t think that it is possible. In some cases (seems like yours is
>> good
 fit) you could create custom update request processor that would do the
 shared analysis (you can have it defined in schema) and after analysis
>> use
 those tokens to create new values for those two fields and remove source
 value (or flag it as ignored in schema).
 
 HTH,
 Emir
 --
 Monitoring - Log Management - Alerting - Anomaly Detection
 Solr & Elasticsearch Consulting Support Training - http://sematext.com/
 
 
 
> On 22 Nov 2017, at 11:09, Roxana Danger 
>> wrote:
> 
> Hello all,
> 
> I would like to reuse the tokenstream generated for one field, to
>> create
 a
> new tokenstream (adding a few filters to the available tokenstream),
>> for
> another field without the need of executing again the whole analysis.
> 
> The particular application is:
> - I have field *tokens* that uses an analyzer that generate the tokens
 (and
> maintains the token type attributes)
> - I would like to have another two new fields: *verbs* and
>> *adjectives*.
> These should reuse the tokenstream generated for the field *tokens* and
> filter the verbs and adjectives for the respective fields.
> 
> Is this feasible? How should it be implemented?
> 
> Many thanks.
 
 
>> 
>>

highlight separator

2017-11-22 Thread David Hastings

Im on solr 5.x at the moment, and am trying to get the highlighter to
display complete sentences containing the match.  setting:

'hl.method' => 'fastVector',
'hl.bs.type' =>'SENTENCE',

hasnt been proving to work.  is there a way for me to do it in the query
itself?
thanks
-Dave

Re: Reusable tokenstream

Mikhail,
Yes, I've just seen your message...

"Hello, Roxana.

You probably looking for TeeSinkTokenFilter, but I believe the idea is
cumbersome to implement in Solr.
Also there is a preanalyzed field which can keep tokenstream in external form."

This is the answer I was looking for. Thanks a lot.
Your second advice is doable. I will reconstruct the tokenstream with its
attributes as a string field and then parse/analysed this preanalysed field
for separate the elements I am interested in...
Thank again,
Roxana



On Wed, Nov 22, 2017 at 11:36 AM, Mikhail Khludnev  wrote:

> Roxana,
> Have you seen my response in "tokenstream reusable" thread?
> reusableTokenStream(java.lang.String
>  Analyzer.html#reusableTokenStream(java.lang.String>,
> doesn't help you. TokenStream is stateless, it holds the attributes for the
> current token only.
> Anyway, it resetted before it's returned for later reuse - it can't carry a
> state.
>
> On Wed, Nov 22, 2017 at 1:43 PM, Roxana Danger 
> wrote:
>
> > Hi Emir,
> > Many thanks for your reply.
> > The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
> >  apache/lucene/analysis/
> > Analyzer.html#reusableTokenStream(java.lang.String,
> > java.io.Reader)> the way to obtain a previous generated tokenstream? is
> it
> > guarantee to get access to the token stream and not reconstruct it?
> > Thanks,
> > Roxana
> >
> >
> > On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> > > Hi Roxana,
> > > I don’t think that it is possible. In some cases (seems like yours is
> > good
> > > fit) you could create custom update request processor that would do the
> > > shared analysis (you can have it defined in schema) and after analysis
> > use
> > > those tokens to create new values for those two fields and remove
> source
> > > value (or flag it as ignored in schema).
> > >
> > > HTH,
> > > Emir
> > > --
> > > Monitoring - Log Management - Alerting - Anomaly Detection
> > > Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> > >
> > >
> > >
> > > > On 22 Nov 2017, at 11:09, Roxana Danger 
> > wrote:
> > > >
> > > > Hello all,
> > > >
> > > > I would like to reuse the tokenstream generated for one field, to
> > create
> > > a
> > > > new tokenstream (adding a few filters to the available tokenstream),
> > for
> > > > another field without the need of executing again the whole analysis.
> > > >
> > > > The particular application is:
> > > > - I have field *tokens* that uses an analyzer that generate the
> tokens
> > > (and
> > > > maintains the token type attributes)
> > > > - I would like to have another two new fields: *verbs* and
> > *adjectives*.
> > > > These should reuse the tokenstream generated for the field *tokens*
> and
> > > > filter the verbs and adjectives for the respective fields.
> > > >
> > > > Is this feasible? How should it be implemented?
> > > >
> > > > Many thanks.
> > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

On 11/22/2017 6:44 AM, Joe Obernberger wrote:
> Right now, we have a relatively small block cache due to the
> requirements that the servers run other software.  We tried to find
> the best balance between block cache size, and RAM for programs, while
> still giving enough for local FS cache.  This came out to be 84 128M
> blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn

Re: Merging of index in Solr

On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> I am using the IndexMergeTool from Solr, from the command below:
>
> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> org.apache.lucene.misc.IndexMergeTool
>
> The heap size is 32GB. There are more than 20 million documents in the two
> cores.

I have looked at IndexMergeTool, and confirmed that it does its job in
exactly the same way that Solr does an optimize, so I would still expect
a rate of 20 to 30 MB per second, unless it's running on REALLY old
hardware that can't transfer data that quickly.

Thanks,
Shawn

Re: Trailing wild card searches very slow in Solr

On 11/20/2017 12:50 PM, Sundeep T wrote:
> I initially asked this question regarding leading wildcards. This was a
> typo, and what I meant was trailing wild card queries were slow. So queries
> like text:'hello*" are slow. We were expecting since the string field is
> already indexed, the searches should be fast, but that seems to be not the
> case

The following is my understanding of wildcard queries:

Let's say that the wildcard "hello*" matches one million terms on the
text field you're searching.  I have no idea what the actual number
would be ... one million is arbitrary.

Behind the scenes, that query object will quite literally have one
million different terms in it, and ALL of those terms will be checked
against the index to obtain the list of matching documents.  Even if
each check is pretty fast, doing a million of them to satisfy the query
is going to take some time.  Also, it's going to take some time to
gather the one million matching terms in the first place.  The term
gathering step is particularly slow with leading wildcards, but even
trailing wildcards can result in slow queries.

Wildcards that only match a few terms might be quick, but predicting the
number of matching terms BEFORE the query is executed is difficult.

Thanks,
Shawn

Re: How to get a solr core to persist

On 11/20/2017 6:26 AM, Amanda Shuman wrote:
> I did as you suggested and created the core by hand - I copied the files
> from the existing core, including the index files (data directory) and
> changed the core.properties file to the new core name (core_new) and
> restarted. Now I'm having a different issue - it says it is Optimized but
> that Current is not (the console shows the red prohibited sign, which I
> guess means false or something?). So basically there's no content at all in
> there. Re-reading your instructions here: " If you want to relocate the
> data, you can add a dataDir property to core.properties.  If it has a
> relative path, it is relative to the core.properties location." - Did I
> miss a step to get the existing index to load?

If data/index is in your core's directory and actually contains a
complete index, then Solr will load that index for that core on startup.

If the way you handle commits is insufficient, then there is a
possibility that the most recent updates to the source index are sitting
in memory and haven't been written to the on-disk index, but it seems
unlikely that this would result in a completely empty index.

Thanks,
Shawn

Re: NullPointerException in PeerSync.handleUpdates

2017-11-22 Thread Pushkar Raste

As mentioned in the JIRA, exception seems to be coming from a log
statement. The issue was fixed in 6.3, here is relevant line f rom 6.3
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.3.0/solr/core/src/java/org/apache/solr/update/PeerSync.java#L707



On Wed, Nov 22, 2017 at 1:18 AM, Erick Erickson 
wrote:

> Right, if there's no "fixed version" mentioned and if the resolution
> is "unresolved", it's not in the code base at all. But that JIRA is
> not apparently reproducible, especially on more recent versions that
> 6.2. Is it possible to test a more recent version (6.6.2 would be my
> recommendation).
>
> Erick
>
> On Tue, Nov 21, 2017 at 9:58 PM, S G  wrote:
> > My bad. I found it at https://issues.apache.org/jira/browse/SOLR-9453
> > But I could not find it in changes.txt perhaps because its yet not
> resolved.
> >
> > On Tue, Nov 21, 2017 at 9:15 AM, Erick Erickson  >
> > wrote:
> >
> >> Did you check the JIRA list? Or CHANGES.txt in more recent versions?
> >>
> >> On Tue, Nov 21, 2017 at 1:13 AM, S G  wrote:
> >> > Hi,
> >> >
> >> > We are running 6.2 version of Solr and hitting this error frequently.
> >> >
> >> > Error while trying to recover. core=my_core:java.lang.
> >> NullPointerException
> >> > at org.apache.solr.update.PeerSync.handleUpdates(
> >> PeerSync.java:605)
> >> > at org.apache.solr.update.PeerSync.handleResponse(
> >> PeerSync.java:344)
> >> > at org.apache.solr.update.PeerSync.sync(PeerSync.java:257)
> >> > at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> >> RecoveryStrategy.java:376)
> >> > at org.apache.solr.cloud.RecoveryStrategy.run(
> >> RecoveryStrategy.java:221)
> >> > at java.util.concurrent.Executors$RunnableAdapter.
> >> call(Executors.java:511)
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> > at org.apache.solr.common.util.ExecutorUtil$
> >> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> > at java.lang.Thread.run(Thread.java:745)
> >> >
> >> >
> >> >
> >> > Is this a known issue and fixed in some newer version?
> >> >
> >> >
> >> > Thanks
> >> > SG
> >>
>

Re: Reusable tokenstream

Hi Emir,
In this case, I need more control at Lucene level, so I have to use the
lucene index writer directly. So, I can not use Solr for importing.
Or, is there anyway I can add a tokenstream to a SolrInputDocument (is
there any other class exposed by Solr during indexing that I can use for
this purpose?).
Am I correct or still missing something?
Thank you.


On Wed, Nov 22, 2017 at 11:33 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Roxana,
> I think you can use https://lucene.apache.org/core/5_4_0/analyzers-common/
> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html <
> https://lucene.apache.org/core/5_4_0/analyzers-common/
> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html> like suggested
> earlier.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Nov 2017, at 11:43, Roxana Danger  wrote:
> >
> > Hi Emir,
> > Many thanks for your reply.
> > The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
> >  apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String,
> > java.io.Reader)> the way to obtain a previous generated tokenstream? is
> it
> > guarantee to get access to the token stream and not reconstruct it?
> > Thanks,
> > Roxana
> >
> >
> > On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Roxana,
> >> I don’t think that it is possible. In some cases (seems like yours is
> good
> >> fit) you could create custom update request processor that would do the
> >> shared analysis (you can have it defined in schema) and after analysis
> use
> >> those tokens to create new values for those two fields and remove source
> >> value (or flag it as ignored in schema).
> >>
> >> HTH,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 22 Nov 2017, at 11:09, Roxana Danger 
> wrote:
> >>>
> >>> Hello all,
> >>>
> >>> I would like to reuse the tokenstream generated for one field, to
> create
> >> a
> >>> new tokenstream (adding a few filters to the available tokenstream),
> for
> >>> another field without the need of executing again the whole analysis.
> >>>
> >>> The particular application is:
> >>> - I have field *tokens* that uses an analyzer that generate the tokens
> >> (and
> >>> maintains the token type attributes)
> >>> - I would like to have another two new fields: *verbs* and
> *adjectives*.
> >>> These should reuse the tokenstream generated for the field *tokens* and
> >>> filter the verbs and adjectives for the respective fields.
> >>>
> >>> Is this feasible? How should it be implemented?
> >>>
> >>> Many thanks.
> >>
> >>
>
>

Re: Solr7: Very High number of threads on aggregator node

2017-11-22 Thread Nawab Zada Asad Iqbal

Rick

Your suspicion is correct. I mostly reused my config from solr4 except
where it was deprecated or obsoleted and I switched to the newer configs:
Having said that I couldn't find any new query related settings which can
impact us, since most of our queries dont use fancy new features.

I couldn't find a decent way to copy long xml here, so i created this
stackoverflow thread:-

https://stackoverflow.com/questions/47439503/solr-7-0-1-aggregator-node-spinning-many-threads

Thanks!
Nawab

On Mon, Nov 20, 2017 at 3:10 PM, Rick Leir  wrote:

> Nawab
> Why it would be good to share the solrconfigs: I had a suspicion that you
> might be using the same solrconfig for version 7 and 4.5. That is unlikely
> to work well. But I could be way off base.
> Rick
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie

Hendrik,

Thanks for your response.

Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?

Thanks

On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
wrote:

> We did some testing and the performance was strangely even better with
> HDFS then the with the local file system. But this seems to greatly
> depend on how your setup looks like and what actions you perform. We now
> had a patter with lots of small updates and commits and that seems to be
> quite a bit slower. We are about to do performance testing on that now.
>
> The reason we switched to HDFS was largely connected to us using Docker
> and Marathon/Mesos. With HDFS the data is in a shared file system and
> thus it is possible to move the replica to a different instance on a a
> different host.
>
> regards,
> Hendrik
>
> On 22.11.2017 14:59, Greenhorn Techie wrote:
> > Hi,
> >
> > Good Afternoon!!
> >
> > While the discussion around issues related to "Solr on HDFS" is live, I
> > would like to understand if anyone has done any performance benchmarking
> > for both Solr indexing and search between HDFS vs local file system.
> >
> > Also, from experience, what would the community folks suggest? Solr on
> > local file system or Solr on HDFS? Has anyone done a comparative study of
> > these choices?
> >
> > Thanks
> >
>
>

Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo

Hi Emir,

Yes, I am running the merging on a Windows machine.
The hard disk is a SSD disk in NTFS file system.

Regards,
Edwin

On 22 November 2017 at 16:50, Emir Arnautović 
wrote:

> Hi Edwin,
> Quick googling suggests that this is the issue of NTFS related to large
> number of file fragments caused by large number of files in one directory
> of huge files. Are you running this merging on a Windows machine?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Nov 2017, at 02:33, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > I have encountered this error during the merging of the 3.5TB of index.
> > What could be the cause that lead to this?
> >
> > Exception in thread "main" Exception in thread "Lucene Merge Thread #8"
> > java.io.
> >
> > IOException: background merge hit exception: _6f(6.5.1):C7256757
> > _6e(6.5.1):C646
> >
> > 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431
> > _6a(6.5.1):C105
> >
> > 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84
> > _26(6.4.1):c44960/8149:de
> >
> > lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31
> > _68(6.5.1)
> >
> > :c66 into _6g [maxNumSegments=1]
> >
> >at
> > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931)
> >
> >
> >
> >at
> > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871)
> >
> >
> >
> >at
> > org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57)
> >
> > Caused by: java.io.IOException: The requested operation could not be
> > completed d
> >
> > ue to a file system limitation
> >
> >at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >
> >at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)
> >
> >at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
> >
> >at sun.nio.ch.IOUtil.write(Unknown Source)
> >
> >at sun.nio.ch.FileChannelImpl.write(Unknown Source)
> >
> >at java.nio.channels.Channels.writeFullyImpl(Unknown Source)
> >
> >at java.nio.channels.Channels.writeFully(Unknown Source)
> >
> >at java.nio.channels.Channels.access$000(Unknown Source)
> >
> >at java.nio.channels.Channels$1.write(Unknown Source)
> >
> >at
> > org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory
> >
> > .java:419)
> >
> >at java.util.zip.CheckedOutputStream.write(Unknown Source)
> >
> >at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
> >
> >at java.io.BufferedOutputStream.write(Unknown Source)
> >
> >at
> > org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre
> >
> > amIndexOutput.java:53)
> >
> >at
> > org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited
> >
> > IndexOutput.java:73)
> >
> >at org.apache.lucene.store.DataOutput.writeBytes(
> DataOutput.java:52)
> >
> >at
> > org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175
> >
> > )
> >
> >at
> > org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition(
> >
> > Lucene50PostingsWriter.java:286)
> >
> >at
> > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting
> >
> > sWriterBase.java:156)
> >
> >at
> > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w
> >
> > rite(BlockTreeTermsWriter.java:866)
> >
> >at
> > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr
> >
> > eeTermsWriter.java:344)
> >
> >at
> > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105
> >
> > )
> >
> >at
> > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter
> >
> > .merge(PerFieldPostingsFormat.java:164)
> >
> >at
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2
> >
> > 16)
> >
> >at
> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> >
> >at
> > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353
> >
> > )
> >
> >at org.apache.lucene.index.IndexWriter.merge(IndexWriter.
> java:3928)
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe
> >
> > rgeScheduler.java:624)
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
> >
> > urrentMergeScheduler.java:661)
> >
> > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> > The req
> >
> > uested operation could not be completed due to a file system limitation
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException
> >
> > (ConcurrentMergeScheduler.java:703)
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
> >
> > urrentMergeScheduler.java:683)
> >
> > Caused by:

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp

We did some testing and the performance was strangely even better with 
HDFS then the with the local file system. But this seems to greatly 
depend on how your setup looks like and what actions you perform. We now 
had a patter with lots of small updates and commits and that seems to be 
quite a bit slower. We are about to do performance testing on that now.


The reason we switched to HDFS was largely connected to us using Docker 
and Marathon/Mesos. With HDFS the data is in a shared file system and 
thus it is possible to move the replica to a different instance on a a 
different host.


regards,
Hendrik

On 22.11.2017 14:59, Greenhorn Techie wrote:

Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Kevin Risden

Thanks for the detailed answers Joe. Definitely sounds like you covered
most of the easy HDFS performance items.

Kevin Risden

On Wed, Nov 22, 2017 at 7:44 AM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi Kevin -
> * HDFS is part of Cloudera 5.12.0.
> * Solr is co-located in most cases.  We do have several nodes that run on
> servers that are not data nodes, but most do. Unfortunately, our nodes are
> not the same size.  Some nodes have 8TBytes of disk, while our largest
> nodes are 64TBytes.  This results in a lot of data that needs to go over
> the network.
>
> * Command is:
> /usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m
> -XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem
> -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m
> -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75
> -XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB
> -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> -Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=30
> -DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181,
> gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida.
> querymasters.com:2181/solr6.6.0 -Dsolr.log.dir=/opt/solr6/server/logs
> -Djetty.port=9100 -DSTOP.PORT=8100 -DSTOP.KEY=solrrocks -Dhost=tarvos
> -Duser.timezone=UTC -Djetty.home=/opt/solr6/server
> -Dsolr.solr.home=/opt/solr6/server/solr -Dsolr.install.dir=/opt/solr6
> -Dsolr.clustering.enabled=true -Dsolr.lock.type=hdfs
> -Dsolr.autoSoftCommit.maxTime=12 -Dsolr.autoCommit.maxTime=180
> -Dsolr.solr.home=/etc/solr6 -Djava.library.path=/opt/cloud
> era/parcels/CDH/lib/hadoop/lib/native -Xss256k -Dsolr.log.muteconsole
> -XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100
> /opt/solr6/server/logs -jar start.jar --module=http
>
> * We have enabled short circuit reads.
>
> Right now, we have a relatively small block cache due to the requirements
> that the servers run other software.  We tried to find the best balance
> between block cache size, and RAM for programs, while still giving enough
> for local FS cache.  This came out to be 84 128M blocks - or about 10G for
> the cache per node (45 nodes total).
>
>  class="solr.HdfsDirectoryFactory">
> true
> true
> 84
> true bool>
> 16384
> true
> true
> 128
> 1024
> hdfs://nameservice1:8020/solr6.6.0 r>
> /etc/hadoop/conf.cloudera.hdfs1 r>
> 
>
> Thanks for reviewing!
>
> -Joe
>
>
>
> On 11/22/2017 8:20 AM, Kevin Risden wrote:
>
>> Joe,
>>
>> I have a few questions about your Solr and HDFS setup that could help
>> improve the recovery performance.
>>
>> * Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
>> * Is Solr colocated with HDFS data nodes?
>> * What is the output of "ps aux | grep solr"? (specifically looking for
>> the
>> Java arguments that are being set.)
>>
>> Depending on how Solr on HDFS was setup, there are some potentially simple
>> settings that can help significantly improve performance.
>>
>> 1) Short circuit reads
>>
>> If Solr is colocated with an HDFS datanode, short circuit reads can
>> improve
>> read performance since it skips a network hop if the data is local to that
>> node. This requires HDFS native libraries to be added to Solr.
>>
>> 2) HDFS block cache in Solr
>>
>> Solr without HDFS uses the OS page cache to handle caching data for
>> queries. With HDFS, Solr has a special HDFS block cache which allows for
>> caching HDFS blocks. This significantly helps query performance. There are
>> a few configuration parameters that can help here.
>>
>> Kevin Risden
>>
>> On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp > >
>> wrote:
>>
>> Hi Joe,
>>>
>>> sorry, I have not seen that problem. I would normally not delete a
>>> replica
>>> if the shard is down but only if there is an active shard. Without an
>>> active leader the replica should not be able to recover. I also just had
>>> a
>>> case where all replicas of a shard stayed in down state and restarts
>>> didn't
>>> help. This was however also caused by lock files. Once I cleaned them up
>>> and restarted all Solr instances that had a replica they recovered.
>>>
>>> For the lock files I discovered that the index is not always in the
>>> "index" folder but can also be in an index. folder. There can
>>> be
>>> an "index.properties" file in the "data" directory in HDFS and this
>>> contains the correct index folder name.
>>>
>>> If you are really desperate you could also delete all but one replica so
>>> that the leader election is quite trivial. But this does of course
>>> increase
>>> the risk of finally loosing the data quite a bit. So I would try looking
>>> into the

Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie

Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks

Result grouping performance

2017-11-22 Thread Kempelen , Ákos

Hello,

I am migrating our codebase from Solr 4.7 to 7.0.1 but the performance of 
result grouping seems very poor using the newer Solr.
For example a simple MatchAllDocsQuery takes 5 sec on Solr4.7, and 21 sec on 
Solr7.
I wonder what causes the x4 difference in time? We hoped that newer Solr 
versions will provide better performances...
Using Field collapsing could would be a solution, but it produces different 
facet counts.
Thanks,
Akos

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger

Hi Kevin -
* HDFS is part of Cloudera 5.12.0.
* Solr is co-located in most cases. We do have several nodes that run
on servers that are not data nodes, but most do. Unfortunately, our
nodes are not the same size. Some nodes have 8TBytes of disk, while our
largest nodes are 64TBytes. This results in a lot of data that needs to
go over the network.

* Command is:
/usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m
-XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m
-XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75
-XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB
-XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
-Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=30
-DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181,gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida.querymasters.com:2181/solr6.6.0
-Dsolr.log.dir=/opt/solr6/server/logs -Djetty.port=9100 -DSTOP.PORT=8100
-DSTOP.KEY=solrrocks -Dhost=tarvos -Duser.timezone=UTC
-Djetty.home=/opt/solr6/server -Dsolr.solr.home=/opt/solr6/server/solr
-Dsolr.install.dir=/opt/solr6 -Dsolr.clustering.enabled=true
-Dsolr.lock.type=hdfs -Dsolr.autoSoftCommit.maxTime=12
-Dsolr.autoCommit.maxTime=180 -Dsolr.solr.home=/etc/solr6
-Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
-Xss256k -Dsolr.log.muteconsole
-XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100
/opt/solr6/server/logs -jar start.jar --module=http

* We have enabled short circuit reads.

Right now, we have a relatively small block cache due to the
requirements that the servers run other software. We tried to find the
best balance between block cache size, and RAM for programs, while still
giving enough for local FS cache. This came out to be 84 128M blocks -
or about 10G for the cache per node (45 nodes total).

true
true
84
name="solr.hdfs.blockcache.direct.memory.allocation">true

16384
true
true
128
1024
hdfs://nameservice1:8020/solr6.6.0
/etc/hadoop/conf.cloudera.hdfs1

Thanks for reviewing!

-Joe

On 11/22/2017 8:20 AM, Kevin Risden wrote:

Joe,

I have a few questions about your Solr and HDFS setup that could help
improve the recovery performance.

* Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
* Is Solr colocated with HDFS data nodes?
* What is the output of "ps aux | grep solr"? (specifically looking for the
Java arguments that are being set.)

Depending on how Solr on HDFS was setup, there are some potentially simple
settings that can help significantly improve performance.

1) Short circuit reads

If Solr is colocated with an HDFS datanode, short circuit reads can improve
read performance since it skips a network hop if the data is local to that
node. This requires HDFS native libraries to be added to Solr.

2) HDFS block cache in Solr

Solr without HDFS uses the OS page cache to handle caching data for
queries. With HDFS, Solr has a special HDFS block cache which allows for
caching HDFS blocks. This significantly helps query performance. There are
a few configuration parameters that can help here.

Kevin Risden

On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp
wrote:

Hi Joe,

sorry, I have not seen that problem. I would normally not delete a replica
if the shard is down but only if there is an active shard. Without an
active leader the replica should not be able to recover. I also just had a
case where all replicas of a shard stayed in down state and restarts didn't
help. This was however also caused by lock files. Once I cleaned them up
and restarted all Solr instances that had a replica they recovered.

For the lock files I discovered that the index is not always in the
"index" folder but can also be in an index. folder. There can be
an "index.properties" file in the "data" directory in HDFS and this
contains the correct index folder name.

If you are really desperate you could also delete all but one replica so
that the leader election is quite trivial. But this does of course increase
the risk of finally loosing the data quite a bit. So I would try looking
into the code and figure out what the problem is here and maybe compare the
state in HDFS and ZK with a shard that works.

regards,
Hendrik

On 21.11.2017 23:57, Joe Obernberger wrote:

Hi Hendrick - the shards in question have three replicas. I tried
restarting each one (one by one) - no luck. No leader is found. I deleted
one of the replicas and added a new one, and the new one also shows as
'down'. I also tried the FORCELEADER call, but that had no effect. I
checked the

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Kevin Risden

Joe,

I have a few questions about your Solr and HDFS setup that could help
improve the recovery performance.

* Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
* Is Solr colocated with HDFS data nodes?
* What is the output of "ps aux | grep solr"? (specifically looking for the
Java arguments that are being set.)

Depending on how Solr on HDFS was setup, there are some potentially simple
settings that can help significantly improve performance.

1) Short circuit reads

If Solr is colocated with an HDFS datanode, short circuit reads can improve
read performance since it skips a network hop if the data is local to that
node. This requires HDFS native libraries to be added to Solr.

2) HDFS block cache in Solr

Solr without HDFS uses the OS page cache to handle caching data for
queries. With HDFS, Solr has a special HDFS block cache which allows for
caching HDFS blocks. This significantly helps query performance. There are
a few configuration parameters that can help here.

Kevin Risden

On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp 
wrote:

> Hi Joe,
>
> sorry, I have not seen that problem. I would normally not delete a replica
> if the shard is down but only if there is an active shard. Without an
> active leader the replica should not be able to recover. I also just had a
> case where all replicas of a shard stayed in down state and restarts didn't
> help. This was however also caused by lock files. Once I cleaned them up
> and restarted all Solr instances that had a replica they recovered.
>
> For the lock files I discovered that the index is not always in the
> "index" folder but can also be in an index. folder. There can be
> an "index.properties" file in the "data" directory in HDFS and this
> contains the correct index folder name.
>
> If you are really desperate you could also delete all but one replica so
> that the leader election is quite trivial. But this does of course increase
> the risk of finally loosing the data quite a bit. So I would try looking
> into the code and figure out what the problem is here and maybe compare the
> state in HDFS and ZK with a shard that works.
>
> regards,
> Hendrik
>
>
> On 21.11.2017 23:57, Joe Obernberger wrote:
>
>> Hi Hendrick - the shards in question have three replicas.  I tried
>> restarting each one (one by one) - no luck.  No leader is found. I deleted
>> one of the replicas and added a new one, and the new one also shows as
>> 'down'.  I also tried the FORCELEADER call, but that had no effect.  I
>> checked the OVERSEERSTATUS, but there is nothing unusual there.  I don't
>> see anything useful in the logs except the error:
>>
>> org.apache.solr.common.SolrException: Error getting leader from zk for
>> shard shard21
>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>> java:996)
>> at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
>> at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
>> at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(
>> ZkContainer.java:181)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1149)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>> props
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1043)
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1007)
>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>> java:963)
>> ... 7 more
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:111)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>> ient.java:357)
>> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>> ient.java:354)
>> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:60)
>> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClie
>> nt.java:354)
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1021)
>> ... 9 more
>>
>> Can I modify zookeeper to force a leader?  Is there any other way to
>> recover from this?  Thanks very much!
>>
>> -Joe
>>
>>
>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>
>>> We sometimes also have replicas not recovering. If one replica is left
>>> active the easiest is to then to delete the replica and create a new one.
>>> When

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger

Hi Hendrick - I was halting a replica and then restarting it, waited, 
then restarted another one.  That didn't work, but when I halted all 
three, and then restarted those one by one, the shard finally elected a 
leader and came up.  Phew!  I too noticed the lock files in 
index. folders.  Usually what I do is:

hadoop fs -ls -R /solr6.6.0 | grep write.lock > out.txt
then
cat out.txt | cut --bytes 57-
to get a list of files to delete

Glad these shards have come up!  Thanks very much.

-Joe


On 11/22/2017 5:20 AM, Hendrik Haddorp wrote:

Hi Joe,

sorry, I have not seen that problem. I would normally not delete a 
replica if the shard is down but only if there is an active shard. 
Without an active leader the replica should not be able to recover. I 
also just had a case where all replicas of a shard stayed in down 
state and restarts didn't help. This was however also caused by lock 
files. Once I cleaned them up and restarted all Solr instances that 
had a replica they recovered.


For the lock files I discovered that the index is not always in the 
"index" folder but can also be in an index. folder. There 
can be an "index.properties" file in the "data" directory in HDFS and 
this contains the correct index folder name.


If you are really desperate you could also delete all but one replica 
so that the leader election is quite trivial. But this does of course 
increase the risk of finally loosing the data quite a bit. So I would 
try looking into the code and figure out what the problem is here and 
maybe compare the state in HDFS and ZK with a shard that works.


regards,
Hendrik

On 21.11.2017 23:57, Joe Obernberger wrote:
Hi Hendrick - the shards in question have three replicas.  I tried 
restarting each one (one by one) - no luck.  No leader is found. I 
deleted one of the replicas and added a new one, and the new one also 
shows as 'down'.  I also tried the FORCELEADER call, but that had no 
effect.  I checked the OVERSEERSTATUS, but there is nothing unusual 
there.  I don't see anything useful in the logs except the error:


org.apache.solr.common.SolrException: Error getting leader from zk 
for shard shard21
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)
    at 
org.apache.solr.cloud.ZkController.register(ZkController.java:902)
    at 
org.apache.solr.cloud.ZkController.register(ZkController.java:846)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader 
props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)

    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

    ... 9 more

Can I modify zookeeper to force a leader?  Is there any other way to 
recover from this?  Thanks very much!


-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
We sometimes also have replicas not recovering. If one replica is 
left active the easiest is to then to delete the replica and create 
a new one. When all replicas are down it helps most of the time to 
restart one of the nodes that contains a replica in down state. If 
that also doesn't get the replica to recover I would check the logs 
of the node and also that of the overseer node. I have seen the same 
issue on Solr using local storage. The main HDFS related issues we 
had so far was those lock files and if you delete and recreate 
collections/cores and it sometimes happens that the data was not 
cleaned up in HDFS and then causes a conflict.


Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have 
no comparison.  What we've been doing is keeping two main 
collections - all data, and the last 30 days of data.  Then

Re: Reusable tokenstream

2017-11-22 Thread Mikhail Khludnev

Roxana,
Have you seen my response in "tokenstream reusable" thread?
reusableTokenStream(java.lang.String
,
doesn't help you. TokenStream is stateless, it holds the attributes for the
current token only.
Anyway, it resetted before it's returned for later reuse - it can't carry a
state.

On Wed, Nov 22, 2017 at 1:43 PM, Roxana Danger 
wrote:

> Hi Emir,
> Many thanks for your reply.
> The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
>  Analyzer.html#reusableTokenStream(java.lang.String,
> java.io.Reader)> the way to obtain a previous generated tokenstream? is it
> guarantee to get access to the token stream and not reconstruct it?
> Thanks,
> Roxana
>
>
> On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi Roxana,
> > I don’t think that it is possible. In some cases (seems like yours is
> good
> > fit) you could create custom update request processor that would do the
> > shared analysis (you can have it defined in schema) and after analysis
> use
> > those tokens to create new values for those two fields and remove source
> > value (or flag it as ignored in schema).
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 22 Nov 2017, at 11:09, Roxana Danger 
> wrote:
> > >
> > > Hello all,
> > >
> > > I would like to reuse the tokenstream generated for one field, to
> create
> > a
> > > new tokenstream (adding a few filters to the available tokenstream),
> for
> > > another field without the need of executing again the whole analysis.
> > >
> > > The particular application is:
> > > - I have field *tokens* that uses an analyzer that generate the tokens
> > (and
> > > maintains the token type attributes)
> > > - I would like to have another two new fields: *verbs* and
> *adjectives*.
> > > These should reuse the tokenstream generated for the field *tokens* and
> > > filter the verbs and adjectives for the respective fields.
> > >
> > > Is this feasible? How should it be implemented?
> > >
> > > Many thanks.
> >
> >
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Reusable tokenstream

Hi Roxana,
I think you can use 
https://lucene.apache.org/core/5_4_0/analyzers-common/org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html
 

 like suggested earlier.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Nov 2017, at 11:43, Roxana Danger  wrote:
> 
> Hi Emir,
> Many thanks for your reply.
> The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
>  java.io.Reader)> the way to obtain a previous generated tokenstream? is it
> guarantee to get access to the token stream and not reconstruct it?
> Thanks,
> Roxana
> 
> 
> On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Roxana,
>> I don’t think that it is possible. In some cases (seems like yours is good
>> fit) you could create custom update request processor that would do the
>> shared analysis (you can have it defined in schema) and after analysis use
>> those tokens to create new values for those two fields and remove source
>> value (or flag it as ignored in schema).
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 22 Nov 2017, at 11:09, Roxana Danger  wrote:
>>> 
>>> Hello all,
>>> 
>>> I would like to reuse the tokenstream generated for one field, to create
>> a
>>> new tokenstream (adding a few filters to the available tokenstream), for
>>> another field without the need of executing again the whole analysis.
>>> 
>>> The particular application is:
>>> - I have field *tokens* that uses an analyzer that generate the tokens
>> (and
>>> maintains the token type attributes)
>>> - I would like to have another two new fields: *verbs* and *adjectives*.
>>> These should reuse the tokenstream generated for the field *tokens* and
>>> filter the verbs and adjectives for the respective fields.
>>> 
>>> Is this feasible? How should it be implemented?
>>> 
>>> Many thanks.
>> 
>>

Re: Reusable tokenstream

Hi Emir,
Many thanks for your reply.
The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
 the way to obtain a previous generated tokenstream? is it
guarantee to get access to the token stream and not reconstruct it?
Thanks,
Roxana


On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Roxana,
> I don’t think that it is possible. In some cases (seems like yours is good
> fit) you could create custom update request processor that would do the
> shared analysis (you can have it defined in schema) and after analysis use
> those tokens to create new values for those two fields and remove source
> value (or flag it as ignored in schema).
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Nov 2017, at 11:09, Roxana Danger  wrote:
> >
> > Hello all,
> >
> > I would like to reuse the tokenstream generated for one field, to create
> a
> > new tokenstream (adding a few filters to the available tokenstream), for
> > another field without the need of executing again the whole analysis.
> >
> > The particular application is:
> > - I have field *tokens* that uses an analyzer that generate the tokens
> (and
> > maintains the token type attributes)
> > - I would like to have another two new fields: *verbs* and *adjectives*.
> > These should reuse the tokenstream generated for the field *tokens* and
> > filter the verbs and adjectives for the respective fields.
> >
> > Is this feasible? How should it be implemented?
> >
> > Many thanks.
>
>

Re: Reusable tokenstream

Hi Roxana,
I don’t think that it is possible. In some cases (seems like yours is good fit) 
you could create custom update request processor that would do the shared 
analysis (you can have it defined in schema) and after analysis use those 
tokens to create new values for those two fields and remove source value (or 
flag it as ignored in schema).

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Nov 2017, at 11:09, Roxana Danger  wrote:
> 
> Hello all,
> 
> I would like to reuse the tokenstream generated for one field, to create a
> new tokenstream (adding a few filters to the available tokenstream), for
> another field without the need of executing again the whole analysis.
> 
> The particular application is:
> - I have field *tokens* that uses an analyzer that generate the tokens (and
> maintains the token type attributes)
> - I would like to have another two new fields: *verbs* and *adjectives*.
> These should reuse the tokenstream generated for the field *tokens* and
> filter the verbs and adjectives for the respective fields.
> 
> Is this feasible? How should it be implemented?
> 
> Many thanks.

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Hendrik Haddorp

Hi Joe,

sorry, I have not seen that problem. I would normally not delete a
replica if the shard is down but only if there is an active shard.
Without an active leader the replica should not be able to recover. I
also just had a case where all replicas of a shard stayed in down state
and restarts didn't help. This was however also caused by lock files.
Once I cleaned them up and restarted all Solr instances that had a
replica they recovered.

For the lock files I discovered that the index is not always in the
"index" folder but can also be in an index. folder. There can
be an "index.properties" file in the "data" directory in HDFS and this
contains the correct index folder name.

If you are really desperate you could also delete all but one replica so
that the leader election is quite trivial. But this does of course
increase the risk of finally loosing the data quite a bit. So I would
try looking into the code and figure out what the problem is here and
maybe compare the state in HDFS and ZK with a shard that works.

regards,
Hendrik

On 21.11.2017 23:57, Joe Obernberger wrote:
Hi Hendrick - the shards in question have three replicas. I tried
restarting each one (one by one) - no luck. No leader is found. I
deleted one of the replicas and added a new one, and the new one also
shows as 'down'. I also tried the FORCELEADER call, but that had no
effect. I checked the OVERSEERSTATUS, but there is nothing unusual
there. I don't see anything useful in the logs except the error:

org.apache.solr.common.SolrException: Error getting leader from zk for
shard shard21
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)

at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
at
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader
props
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)

... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

... 9 more

Can I modify zookeeper to force a leader? Is there any other way to
recover from this? Thanks very much!

-Joe

On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
We sometimes also have replicas not recovering. If one replica is
left active the easiest is to then to delete the replica and create a
new one. When all replicas are down it helps most of the time to
restart one of the nodes that contains a replica in down state. If
that also doesn't get the replica to recover I would check the logs
of the node and also that of the overseer node. I have seen the same
issue on Solr using local storage. The main HDFS related issues we
had so far was those lock files and if you delete and recreate
collections/cores and it sometimes happens that the data was not
cleaned up in HDFS and then causes a conflict.

Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have
no comparison. What we've been doing is keeping two main
collections - all data, and the last 30 days of data. Then we
handle queries based on date range. The 30 day index is
significantly faster.

My main concern right now is that 6 of the 100 shards are not coming
back because of no leader. I've never seen this error before. Any
ideas? ClusterStatus shows all three replicas with state 'down'.

Thanks!

-joe

On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
We actually also have some performance issue with HDFS at the
moment. We are doing lots of soft commits for NRT search. Those
seem to be slower then with local storage. The investigation is

Reusable tokenstream

Hello all,

I would like to reuse the tokenstream generated for one field, to create a
new tokenstream (adding a few filters to the available tokenstream), for
another field without the need of executing again the whole analysis.

The particular application is:
- I have field *tokens* that uses an analyzer that generate the tokens (and
maintains the token type attributes)
- I would like to have another two new fields: *verbs* and *adjectives*.
These should reuse the tokenstream generated for the field *tokens* and
filter the verbs and adjectives for the respective fields.

Is this feasible? How should it be implemented?

Many thanks.

Re: LTR training

2017-11-22 Thread ilayaraja

Thanks Diego for the pointers..will check.



-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Merging of index in Solr