it of this (3*4GB * 2 = 24GB):
>
> http://www.kingston.com/datasheets/KHX1600C9S3K2_8GX.pdf
>
> On Wed, Jan 11, 2012 at 9:33 AM, Dawid Weiss wrote:
>
>> This is a fairly old VM you're running with, but if you get the same
>> error with jrockit then I would assume it
as the rest of your suggestions and post back the results.
>
> Thanks.
>
>
> On Wed, Jan 11, 2012 at 9:56 AM, Dawid Weiss wrote:
>
>> The dump you're getting indicates a sigserv in a garbage collection.
>> This isn't unlikely (there are bugs in there as well
>
>
> On Wed, Jan 11, 2012 at 10:06 AM, Dawid Weiss wrote:
>>
>> Opps, yes, sorry -- I only quickly looked at the invocation line on
>> stack overflow and overlooked it. -Xms4g shouldn't make any
>> difference.
>>
>> Dawid
>>
>> On We
> For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
> BasicAutomata.makeString(String) on each of your expanded terms, then
> BasicOperations.union on all of those automata to make a single
How many input strings do you have? The API Mike mentioned in from a
port of the Brics li
I filed an issue for that.
https://issues.apache.org/jira/browse/LUCENE-3832
I'll try to port it myself actually. It shouldn't be a big problem.
Dawid
On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless
wrote:
> Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
> the insert/
The issue has a patch -- feel free to try it out.
Dawid
On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss wrote:
> I filed an issue for that.
> https://issues.apache.org/jira/browse/LUCENE-3832
>
> I'll try to port it myself actually. It shouldn't be a big problem.
>
> Da
> Wow, that was quick! Thanks!
The power of open source and coffee break, combined...
> I don't think we'll have too many terms per query term - as I said earlier,
> we're restricting the expansions to those with an edit distance of 1. But
> this looks cool anyway.
Shouldn't make much of a d
> Rum is an essential ingredient in all software systems :-)
You probably meant "social systems".
D.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache
> Why anyone buys computers without SSD's is a mystery to me. Use SSDs for
On topic and highly recommended:
http://www.youtube.com/watch?v=H7PJ1oeEyGg
Dawid
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For a
Read this:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
Dawid
On Thu, Jul 19, 2012 at 1:32 PM, Dragon Fly wrote:
>
> The slowest part of my application is to read the search hits from disk. I
> was hoping that using an SSD or RAMDirectory/MMapDirectory would speed th
http://static1.blip.pl/user_generated/update_pictures/1758685.jpg
On Thu, Aug 2, 2012 at 8:32 AM, roz dev wrote:
> wow!! That was quick.
>
> Thanks a ton.
>
>
> On Wed, Aug 1, 2012 at 11:07 PM, Simon Willnauer
> wrote:
>
>> On Thu, Aug 2, 2012 at 7:53 AM, roz dev wrote:
>> > Thanks Robert for th
What you need is a suffix tree or a suffix array. Both data structures
will allow you to perform constant-time searches for existence/
occurrence of any input pattern. Depending on how much text you have
on the input it may either be a simple task -- see here:
http://labs.carrotsearch.com/jsuffixa
> Does Lucene support this type of structure, or do I need to somehow implement
> it outside Lucene?
You'd have to implement it separately but it'd be much, much smaller
than Lucene itself (even obfuscated).
> By the way, I need this to run on an Android phone so size of memory might be
> an is
> The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
> After that, you can wildcards. This will use very little space. I
> believe leading&trailing wildcards are supported now, right?
If leading wildcards take too much time (don't know, really) then one
could also try to index
> https://issues.apache.org/jira/browse/LUCENE-4491 ? Could you simply
> stuff your ISBN onto the end of the suggestion (ie enroll Lucene in
> Action|1933988177)?
Just remember that if your suffixes are unique then you'll be
expanding the automaton quite a bit (unique suffix paths).
D.
Iterating character-by-character is different than considering the
entire string at once so your observation is correct, that's how it's
supposed to work. In particular, note this in String#toLowerCase
documentation:
"Since case mappings are not always 1:1 char mappings, the resulting
String may b
> Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and
> then have a separate search index with "boston red sox rumor" indexed
> as a document. If the user types "red so", then you run suggest on
> "red" and on "so", and then run a hmm MultiPhraseQuery for
> (red|redmond|reddit)
Jerome,
Some of the tokens are removed because their part of speech tags are
in the stoptags file? That's my guess at least -- you can always try
to copy/paste Japanese analyzer and change the token stream
components:
protected TokenStreamComponents createComponents(String fieldName,
Reader rea
Here's another thought: if you desperately need complex searches then
you could do a heuristic filtering to narrow down the search: use an
analyzer that does some form of input splitting into terms (removing
excess whitespace or even producing n-grams from the input), then do
the same for the query
Start Tomcat with class loading info and inspect the logs for multiple
Lucene JARs (even though the version seems to be fine)?
export CATALINA_OPTS=-XX:+TraceClassLoading
$TOMCAT_HOME/bin/catalina run
Dawid
On Wed, Feb 16, 2011 at 10:23 AM, starz10de wrote:
>
> Hi All,
>
> I have an application
Can you shed some more light on what you're trying to achieve (what is
the purpose of clustering -- are clusters to be utilized for front-end
user interface, further data mining analysis, etc.)?
With the sizes you report Carrot2 won't work for you, I'm afraid, but
Mahout may. Still, there's plenty
> 1) We index around 20 fields, of that we want to have grouping option
> for five of them. For ex., user can search on name of the city and we
> should have option to group by products available in that city (and
> vice-versa).
>
Are these fields stricly defined or free text? Because if they are
> that IP etc. These are definitely not dictionary fields.
>
> I'm looking at faceting right now - checking if this would work with
> Lucene (as we can not change to Solr at this point). What's the main
> difference between clustering and faceting?
>
> Thanks,
> -vivek
&
There are solutions to solve the initialization problem. The JVM guarantees
> that an object is consistent after the ctor is run, so you can do the
> initialization like this (please note the double {{}}, which is an inline
> ctor, this is also often seen for unmodifiable HashSets):
>
> final OpenB
> In general a *newly* created object that was not yet seen by any other
> thread is always safe. This is why I said, set all bits in the ctor. This
> is
> easy to understand: Before the ctor returns, the object's contents and all
> references like arrays are not seen by any other thread (that's
>
> static void writer() {
>f = new FinalFieldExample();
> }
> static void reader() {
>if (f != null) {
> int i = f.x; // guaranteed to see 3
> int j = f.y; // could see 0
>}
> }
>}
>
In this snippet of code there's not even a guarant
Don't know if this helps, but debugging stuff like this I simply add a
(manually inserted or aspectj-injected) recursion count, add a breakpoint
inside an if checking for recursion count >> X and run the vm with an
attached socket debugger. This lets you run at (nearly) full speed and once
you hit
> lucene/Search that is taking the time, I also had another attempt using
> luke
> > but find it incredibly buggy and of little use
>
Can you expand on this too? What kind of "incredible bugs" did you see?
Without feedback there is little progress, so bug reports count.
Dawid
:17, Dawid Weiss wrote:
>
>
>
> > lucene/Search that is taking the time, I also had another attempt using
>> luke
>> > but find it incredibly buggy and of little use
>>
>
> Can you expand on this too? What kind of "incredible bugs" did you see?
>
This is the e-mail you're looking for, Steven (it wasn't forwarded to the
list, apparently).
Dawid
-- Forwarded message --
From: Paul Taylor
Date: Fri, Apr 29, 2011 at 10:11 PM
Subject: Re: Lucene 3.0.3 with debug information
To: Dawid Weiss
On 29/04/2011 15:17, D
If you check out the source code of solr/lucene, look at FSTLookup class and
FSTLookupTest -- you can populate FSTLookup manually with terms/ phrases
from your index and then use the resulting automaton for suggestions.
Dawid
On Thu, May 5, 2011 at 2:54 PM, Clemens Wyss wrote:
> I have implemen
p "sits" behind this suggester
> http://search-lucene.com/m/586gA4ccL11
> here?
>
>
> > -Ursprüngliche Nachricht-
> > Von: Dawid Weiss [mailto:dawid.we...@gmail.com]
> > Gesendet: Donnerstag, 5. Mai 2011 15:00
> > An: java-user@lucene.apache.org
uggested terms
> untouched, i.e. cased.
>
> Clemens
>
> > -Ursprüngliche Nachricht-
> > Von: Dawid Weiss [mailto:dawid.we...@gmail.com]
> > Gesendet: Freitag, 6. Mai 2011 11:12
> > An: java-user@lucene.apache.org
> > Betreff: Re: Using Solr's
to add "fuzzy" lookup for terms?
>
> E.g.:
> "melo" should also bring up "merlot"
>
> > -Ursprüngliche Nachricht-
> > Von: Dawid Weiss [mailto:dawid.we...@gmail.com]
> > Gesendet: Freitag, 6. Mai 2011 11:30
> > An: java-user@
First of all I should probably congratulate my fellow Germans -- Dirk
Nowitzki's outstanding performance during this year's NBA finals will
become part of the history of basketball. As a Pole, I admit I'm
really freaking jealous.
Now... back to the subject.
A number of people have expressed an in
Can you describe the kind of sorting you're doing? Maybe the data is
already sorted (and in RAM) and you're only getting it out?
Dawid
On Fri, Jun 24, 2011 at 3:32 AM, Denis Bazhenov wrote:
> Well, maybe it's a bit controversial question, but anyway...
>
> Lucene is a great toolkit for search ap
ting by field value. We have around 1M documents
> which we are searching and returns them to the user in reverse order by
> creation date. Creation date is indexed in separated field in lucene of
> course.
>
> On Jun 24, 2011, at 4:52 PM, Dawid Weiss wrote:
>
>> Can yo
Elmer. Tst will have a large overhead. Fst may not be that much better if
your input has very few shared pre or suffixes. In your case i think this is
unfortunately true. What i would do is create a regular lucene index and
store it on disk. Then run prefix queries on it. Should work and scale to
l
tion today but maybe we
> should? Suffix sharing requires sizable RAM while building because it
> maintains a hash containing all nodes in order to locate the dups.
>
> It's also possible to improve FST to have shades of gray between
> on/off... I'll open an issue.
>
&g
while ago, but I've been swamped with
other work, sorry.
Dawid
On Thu, Jul 7, 2011 at 7:16 PM, Michael McCandless
wrote:
> On Thu, Jul 7, 2011 at 7:00 AM, Dawid Weiss wrote:
>> Another option to tradeoff dize and mem is to do a lru like cache of suffix
>> nodes/ registry. Im s
This one is humorous (watch for foul language though). It does get to the
point, however, and Bergman is a clever guy:
http://www.livestream.com/oreillyconfs/video?clipId=pla_3beec3a2-54f5-4a19-8aaf-35a839b6ecaa
Dawid
On Tue, Aug 23, 2011 at 10:00 AM, Toke Eskildsen
wrote:
> On Mon, 2011-08-22
>
>
> We installed SSDs in all developer machines in 2009 (Intel X25) and
> haven't looked back.
>
>
I can confirm this from my own experience. Once you have a (fast) SSD on
your development machine you are not likely to go back to a spinning
drive...
Dawid
Also, run memtest on your machine to rule out memory corruption; this
unfortunately may cause effects like the one you're describing.
Dawid
On Thu, Sep 1, 2011 at 11:21 AM, Federico Fissore wrote:
> Dragan Jotanovic, il 01/09/2011 11:12, ha scritto:
>>
>> Hi,
>> I recently upgraded to lucene 3.3
Hi Grant,
In Carrot2 (and Carrot Search's commercial products) we're not using
Lucene as an indexing/ search service directly, but we are re-using a
lot of internal infrastructure (like analyzers, ported snowball
stemmers and other segmentation stuff). We also plan on using the new
language identi
> What can possibly cause this exception? I can't be calling the constructor of
> IndexWriter twice, can I ;)
I beet Chuck Norris can do that! :)
Dawid
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For addit
Avg lookup time slightly less than a HashSet? Interesting. Is the code
to these benchmarks available somewhere?
Dawid
On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote:
>
> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>
using Lucene that don't fit under the core premise of full te
> Lucene started out at an avg 3ms but subsequent runs took it down
> dramatically due to OS file caching. The all-in-memory hashset implementation
> clearly did not demonstrate the same speed ups between runs.
I don't say the benchmark was wrong or anything, but this is
surprising. I mean, the
m also using public domain Wikipedia data so can release the code and data
> somewhere if that's of interest.
>
> Cheers
> Mark
>
>
>
> - Original Message -
> From: Dawid Weiss
> To: java-user@lucene.apache.org
> Cc:
> Sent: Tuesday, 25 October 2011,
I am currently working on a refactoring of FSTLookup so that either one or
both of your objectives will be possible.
I would still argue that storing exact scores does not make much sense
(think: if you collect query logs then you probably won't differentiate
between two suggestions that differ by
Hi Michal,
Pretty cool. Your work reminds me of what Leo Galambos did a while back:
http://link.springer.com/chapter/10.1007/978-3-540-39985-8_22
I believe his implementation is still available in the Egothor search
engine project.
Dawid
On Wed, Oct 23, 2013 at 5:17 PM, Michal Hlavac wrote:
Hi Steve,
I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively easy) to create an
analyzer (or a modificatio
> I spotted Uwe's comment in JIRA the other day "BTRFS, which might also
> bring some cool things for Lucene.".
What cool things about BTRFS are you talking about, Uwe? Just curious.
Dawid
-
To unsubscribe, e-mail: java-user
.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Dawid Weiss [mailto:dawid.we...@gmail.com]
>> Sent: Monday, December 22, 2014 8:48 AM
>> To: java-user@lucene.apache.org
>> Cc: Uwe Schindler
>> Subject: Re: BTRFS ?
>&
> This could speed up tests, especially Solr where some dirs are copied over
> and over for every test case. :-)
A wild idea, but since there's NIO everywhere now you could use an
in-memory filesystem for tests and avoid going to disk entirely :D
https://github.com/google/jimfs
Dawid
-
Thanks for contributing time to the release, Anshum.
Dawid
On Fri, Feb 20, 2015 at 10:16 PM, Anshum Gupta wrote:
> Sure, I'll fix that on the wiki. Thanks for pointing that out Uwe.
>
> On Fri, Feb 20, 2015 at 1:10 PM, Uwe Schindler wrote:
>
>> Many thanks! :-) Nice work!
>>
>> I found a small
Yes, BytesRef can be surprising. No, it probably won't change in
Lucene to comply with superb design principles. Yes, the odd design is
there for performance reasons and it does provide noticeable gain.
Perhaps you could file a JIRA issue to improve the documentation, this
would be helpful. For wh
> BytesRef is not different, because it is just a "reference" to pass around.
> And cloning a reference for sure should not clone the target of the
> reference. You are "cloning" the reference and only that (as the name of the
> class says: Bytes*Ref*)!
Exactly. It is a reference and as such, c
> Otherwise, it violates the Liskov substitution principle as well.
Sadly it also violates the Heisenberg's principle at the bit state
energy levels. We're working on improving that.
>From your heated comments I think you should switch the language to
something that guarantees immutability of any
It is (b).
D.
On Fri, Aug 7, 2015 at 3:05 AM, Trejkaz wrote:
> I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2.
>
> During this process, I noticed that the FST used by the Japanese
> analyser (AKA Kuromoji) was changing between releases. As I fear
> breakages in backwards comp
I think the issue here is what happens if an "uppercase" codepoint requires
a surrogate pair and the lowercase counterpart does not -- then the index
variable would indeed be screwed.
Dawid
On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler wrote:
> Hi,
>
> > Setting aside the fact that Character.
unt(Character.toLowerCase(cp));
if (c1 != c2 ||
c1 != c3) {
System.out.println(String.format(Locale.ROOT,
"%d %d %d",
c1, c2, c3));
}
}
D.
On Thu, Oct 22, 2015 at 10:15 AM, Dawid Weiss wrote:
> LowerCaseFilter will not handle that. So whereas it is "safe" for
> English hard-coded strings, it isn't safe for all fields you might
> index in general.
This filter is a "safe" fallback that works identically regardless of
the locale you
have on your computer (or on the server). This, I believ
You can addIndexes(Directory... dirs) -- then you don't have to deal
with CodecReader?
Dawid
On Tue, Jan 12, 2016 at 4:43 PM, Manner Róbert wrote:
> Hi,
>
> we have used lucene 4.7.0 before, we are on the way to upgrade to 5.4.0.
>
> The problem I have is that writer.addIndexes now needs CodecRe
The GC change is after this:
BJ (2015-12-02): Upgrade to beast2 (72 cores, 256 GB RAM)
which leads me to believe these results are not comparable (different
machines, architectures, disks, CPUs perhaps?).
Dawid
On Thu, Apr 14, 2016 at 7:13 PM, Otis Gospodnetić
wrote:
> Hi,
>
> I was looking a
You could try to implement this refactoring, which would combine
linear storage of values (without the need to save the length of each
key explicitly) with their incremental addition order.
https://issues.apache.org/jira/browse/LUCENE-5854
The outcome may or may not be faster in practice (due to
> Point taken, but I wonder if there's an algorithmic shortcut to determinize
> the union of Levenshtein DFAs...
Levenshtein DFA is an automaton like any other; when you merge two
such automata, they will very likely contain states that need to be
merged (and their transition split) in order to be
> I think I see this now, and how skipping determinization and matching with
> the NFA could easily leave you with an intractable amount of backtracking
> for even the simpler binary question of does my input match any of the
> automatons I've unioned.
Note that with NFAs you may answer the questi
There are multiple Highlighter implementations for this purpose. Check
them out -- I'm sure one of them will suit your needs. In fact,
there's a new highlighter implemented very recently! Check out this
JIRA issue:
https://issues.apache.org/jira/browse/LUCENE-7438
Dawid
On Fri, Sep 30, 2016 at 8
Which Lucene version and which query parser is this? Can you provide a
test case/ code sample?
I just tried with StandardQueryParser and for:
sqp.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
dump(sqp.parse("foo AND bar OR baz", "field_a"));
sqp.setDefaultOpe
curl -s 'localhost:9200/test/_search?pretty' -d '{ "query": {
> "query_string": { "query": "foo AND bar OR baz" , "default_operator": "or"
> } } , "profile" : true}' | grep luce
> "lucene&qu
> But it is fairly trivially to tweak/extend the query parser to produce
> diff behavior.
I think the conclusion for the original poster should be that there's
really not enough information to provide a definite answer. Lucene is
a search engine. Much like with a mechanical engine, its final
appli
You could try using morfologik's byte-based implementation:
https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
I can't guarantee it'll be fast enough -- you need to sort those input
sequences and even thi
tate registry
> more ram efficient too ... I think it's essentially the same thing as
> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> automata.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 a
> PatriciaTrie. In particular building an FST with doShareSuffix = false is
> the fastest of any option,
If you don't share the suffix then you are building a kind of Patricia
trie... But suffix sharing is cheap and can give you a memory saving
(and resulting cache locality sometimes) that is non-
Or you could encode those term/ ngram frequencies one FST and then
reuse it. This would be memory-saving and fairly fast (~comparable to
a hash table).
Dawid
On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
wrote:
> Yes, this is a reasonable way to use Lucene (to see terms statistics across
> I'd like to read something written by who designed these classes. What
> motivated, usage examples, what it is good for and what it is not good for.
> Maybe a history of the development of Automata on Lucene
Are you looking for a historical book on Lucene development or are you
looking to solve
> One small correction: we moved away from objects to more compact int[] a
> while ago for our automata implementation.
Right, forgot about that. There are still some trappy object-heavy
utilities like this one:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/luc
> Dawid, the thing is that I am not even sure that Automata are the perfect
> fit for my project and I thought some literature on it would help me decide
> whether to use it or not.
Still looks to me like you're approaching the problem from the wrong
side or don't
want to share the core problem wh
Thanks for your explanation, David.
I actually found working with all Lucene highlighters pretty
difficult. I have a few requirements which seemed deceptively simple:
1) highlight query hit regions (phrase, fuzzy, terms);
2) try to organise the resulting snippets to visually "center" the hit
regi
> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3.
> The UH can't currently do this; but with the OH (original Highlighter) you
> can but it appears somewhat awkward. See SimpleSpanFragmenter. I had said
> it was easy but I was mistaken; I'm getting rustier on the OH.
https://issues.apache.org/jira/browse/SOLR-1105
Yes, this is spot-on what I need with regard to copyTo fields, thanks
for the link!
> Or are the overlaps coming from passage offset ranges from separate queries
> to the same content?
The overlaps are caused by the fact that we have multiple sour
> it will be good if Lucene team can share their plans for a full java 9
> support (e.g. named modules of Lucene libraries)
So, here it is: we plan to support it. (*)
Dawid
(*) When it's stabilized and documented (it still isn't) [1]. And when
somebody has the time to do it (patches welcome, it'
Hi Mike. Search lucene dev archives. I did write a decompounder with Daniel
Naber. The quality was not ideal but perhaps better than nothing. Also,
Daniel works on languagetool.org? They should have something in there.
Dawid
On Sep 16, 2017 1:58 AM, "Michael McCandless"
wrote:
> Hello,
>
> I ne
> Hi , it is possible to create a Automaton in lucene parsing not a string
> but a byte array?
Can you state what problem are you trying to solve? This seems to be a
question stripped of a more general context -- why do you need those
byte-based automata?
Dawid
--
for example,
> be usefull in bioinformatic or all those cases where data is not a basic
> ADT.
>
> Cristian
>
> 2017-09-30 12:24 GMT+02:00 Dawid Weiss :
>
>> > Hi , it is possible to create a Automaton in lucene parsing not a string
>> > but a byte array?
How about the quickest solution: dump the content of both indexes to a
document-per-line text
file, sort, diff?
Even if your indexes are large, if you have large spare disk, this
will be super fast.
Dawid
On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra
wrote:
> Hi,
>
> We use Lucene for indexin
basis using the Lucene API?
> Chetan Mehrotra
>
>
> On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss wrote:
>> How about the quickest solution: dump the content of both indexes to a
>> document-per-line text
>> file, sort, diff?
>>
>> Even if your indexes are large,
t. Actual indexed content would be same if both index have
> "status" field indexed so we only need to validate fieldnames per
> document. Something like
>
> Thanks for reading all this if you have read so far :)
>
> Chetan Mehrotra
> [1]
> https://github.com/apach
> That helps and explains why there is no support in std api
This isn't an API problem. This is by design -- this is how it works.
If you wish
to retrieve fields that are indexed and stored with the document, the
API provides
such an option (indexed and stored field type). Your indexed fields
are
Erick already pointed you at the "cleanup" rule. This is fairly
generic, but if you know
the properties being modified you should still clean them up in @After or
@AfterClass -- this is useful for other people to know that you're modifying
them, if for nothing else.
Randomized testing package has
Use MMapDirectory on a temporary location, Matthias. If you really
need in-memory indexes, a new Directory implementation is coming
(RAMDirectory will be deprecated, then removed), but the difference
compared to MMapDirectory is typically not worth the hassle. See this
issue for more discussion.
h
We are faced with a similar situation. Yes, the merge process can take
a long time and is mostly single-threaded (if you're merging from N
segments into a single segment, only one thread does the job). As
Erick pointed out, the merge process takes a backseat compared to
indexing and searches (in mo
> int processors = Runtime.getRuntime().availableProcessors();
> int ConcurrentMergeScheduler cms = new ConcurrentMergeScheduler();
> cms.setMaxMergesAndThreads(processors,processors);
See the number of threads in the CMS only matters if you have
concurrent merges of independent segments. What you
Thanks for chipping in, Toke. A ~1TB index is impressive.
Back of the envelope says reading & writing 900GB in 8 hours is
2*900GB/(8*60*60s) = 64MB/s. I don't remember the interface for our
SSD machine, but even with SATA II this is only ~1/5th of the possible
fairly sequential IO throughput. So f
/jira/browse/LUCENE-8580
Dawid
On Fri, Nov 2, 2018 at 10:17 PM Dawid Weiss wrote:
>
> Thanks for chipping in, Toke. A ~1TB index is impressive.
>
> Back of the envelope says reading & writing 900GB in 8 hours is
> 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interfa
bq. We switched to ByteBuffersDirectory with 7.5, but
I actually didn't see much performance improvements or savings in memory.
Once the indexes are built I don't think there will be much of a
difference. The core problem with RAMDirectory was related to
synchronizations during merges/ file manipu
> It's entirely possible it fails to dig into Maps correctly with newer Java
> releases; maybe Dawid or Uwe would know?
We have removed all reflection from that class a while ago exactly
because of encapsulation issues introduced in newer Java versions.
https://github.com/apache/lucene-solr/blob/
ashMap by simply counting the size of the
> Node that is used for each entry, although given the dynamic nature of
> these data structures (HashMap eg can use TreeNodes sometimes
> depending on data distribution) it would be almost impossible to be
> 100% accurate.
> On Thu, Dec 6, 2018
k on it.
>
> Regards,
> Jerven
> On 11/30/18 12:01 PM, Dawid Weiss wrote:
> > Just FYI: I implemented a quick and dirty PoC to see what it'd work
> > like. Not much of a difference on my machine (since postings merging
> > dominates everything else). Interesting prob
> We have chosen G1GC for both Java 8 and Java 11 versions.
It's not like we have answers for everything. ;) If it's the same GC
on both and there is still a slowdown then something else may be
causing it -- hard to tell without doing trial-and-error. There is a
set of performance benchmarks; perh
1 - 100 of 164 matches
Mail list logo