Re: SIGSEGV when indexing documents.

2012-01-11 Thread Dawid Weiss
it of this (3*4GB * 2 = 24GB): > > http://www.kingston.com/datasheets/KHX1600C9S3K2_8GX.pdf > > On Wed, Jan 11, 2012 at 9:33 AM, Dawid Weiss wrote: > >> This is a fairly old VM you're running with, but if you get the same >> error with jrockit then I would assume it

Re: SIGSEGV when indexing documents.

2012-01-11 Thread Dawid Weiss
as the rest of your suggestions and post back the results. > > Thanks. > > > On Wed, Jan 11, 2012 at 9:56 AM, Dawid Weiss wrote: > >> The dump you're getting indicates a sigserv in a garbage collection. >> This isn't unlikely (there are bugs in there as well

Re: SIGSEGV when indexing documents.

2012-01-11 Thread Dawid Weiss
> > > On Wed, Jan 11, 2012 at 10:06 AM, Dawid Weiss wrote: >> >> Opps, yes, sorry -- I only quickly looked at the invocation line on >> stack overflow and overlooked it. -Xms4g shouldn't make any >> difference. >> >> Dawid >> >> On We

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use > BasicAutomata.makeString(String) on each of your expanded terms, then > BasicOperations.union on all of those automata to make a single How many input strings do you have? The API Mike mentioned in from a port of the Brics li

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
I filed an issue for that. https://issues.apache.org/jira/browse/LUCENE-3832 I'll try to port it myself actually. It shouldn't be a big problem. Dawid On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless wrote: > Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for > the insert/

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
The issue has a patch -- feel free to try it out. Dawid On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss wrote: > I filed an issue for that. > https://issues.apache.org/jira/browse/LUCENE-3832 > > I'll try to port it myself actually. It shouldn't be a big problem. > > Da

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
> Wow, that was quick!  Thanks! The power of open source and coffee break, combined... > I don't think we'll have too many terms per query term - as I said earlier, > we're restricting the expansions to those with an edit distance of 1.  But > this looks cool anyway. Shouldn't make much of a d

Re: RAM or SSD...

2012-07-18 Thread Dawid Weiss
> Rum is an essential ingredient in all software systems :-) You probably meant "social systems". D. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache

Re: RAM or SSD...

2012-07-18 Thread Dawid Weiss
> Why anyone buys computers without SSD's is a mystery to me. Use SSDs for On topic and highly recommended: http://www.youtube.com/watch?v=H7PJ1oeEyGg Dawid - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For a

Re: RAM or SSD...

2012-07-19 Thread Dawid Weiss
Read this: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Dawid On Thu, Jul 19, 2012 at 1:32 PM, Dragon Fly wrote: > > The slowest part of my application is to read the search hits from disk. I > was hoping that using an SSD or RAMDirectory/MMapDirectory would speed th

Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter

2012-08-01 Thread Dawid Weiss
http://static1.blip.pl/user_generated/update_pictures/1758685.jpg On Thu, Aug 2, 2012 at 8:32 AM, roz dev wrote: > wow!! That was quick. > > Thanks a ton. > > > On Wed, Aug 1, 2012 at 11:07 PM, Simon Willnauer > wrote: > >> On Thu, Aug 2, 2012 at 7:53 AM, roz dev wrote: >> > Thanks Robert for th

Re: Efficient string lookup using Lucene

2012-08-24 Thread Dawid Weiss
What you need is a suffix tree or a suffix array. Both data structures will allow you to perform constant-time searches for existence/ occurrence of any input pattern. Depending on how much text you have on the input it may either be a simple task -- see here: http://labs.carrotsearch.com/jsuffixa

Re: Efficient string lookup using Lucene

2012-08-26 Thread Dawid Weiss
> Does Lucene support this type of structure, or do I need to somehow implement > it outside Lucene? You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even obfuscated). > By the way, I need this to run on an Android phone so size of memory might be > an is

Re: Efficient string lookup using Lucene

2012-08-26 Thread Dawid Weiss
> The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines. > After that, you can wildcards. This will use very little space. I > believe leading&trailing wildcards are supported now, right? If leading wildcards take too much time (don't know, really) then one could also try to index

Re: WFST/Analyzing Suggesters: foreign keys, user-supplied filter, highlighting

2012-10-30 Thread Dawid Weiss
> https://issues.apache.org/jira/browse/LUCENE-4491 ? Could you simply > stuff your ISBN onto the end of the suggestion (ie enroll Lucene in > Action|1933988177)? Just remember that if your suffixes are unique then you'll be expanding the automaton quite a bit (unique suffix paths). D.

Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-12-01 Thread Dawid Weiss
Iterating character-by-character is different than considering the entire string at once so your observation is correct, that's how it's supposed to work. In particular, note this in String#toLowerCase documentation: "Since case mappings are not always 1:1 char mappings, the resulting String may b

Re: Suggesters: circumfix suggestions

2013-01-16 Thread Dawid Weiss
> Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and > then have a separate search index with "boston red sox rumor" indexed > as a document. If the user types "red so", then you run suggest on > "red" and on "so", and then run a hmm MultiPhraseQuery for > (red|redmond|reddit)

Re: Japanese analyzer

2013-01-18 Thread Dawid Weiss
Jerome, Some of the tokens are removed because their part of speech tags are in the stoptags file? That's my guess at least -- you can always try to copy/paste Japanese analyzer and change the token stream components: protected TokenStreamComponents createComponents(String fieldName, Reader rea

Re: Lucene vs Glimpse

2013-02-05 Thread Dawid Weiss
Here's another thought: if you desperately need complex searches then you could do a heuristic filtering to narrow down the search: use an analyzer that does some form of input splitting into terms (removing excess whitespace or even producing n-grams from the input), then do the same for the query

Re: problem with the lucene and tomcat server

2011-02-16 Thread Dawid Weiss
Start Tomcat with class loading info and inspect the logs for multiple Lucene JARs (even though the version seems to be fine)? export CATALINA_OPTS=-XX:+TraceClassLoading $TOMCAT_HOME/bin/catalina run Dawid On Wed, Feb 16, 2011 at 10:23 AM, starz10de wrote: > > Hi All, > > I have an application

Re: Clustering with Lucene?

2011-04-26 Thread Dawid Weiss
Can you shed some more light on what you're trying to achieve (what is the purpose of clustering -- are clusters to be utilized for front-end user interface, further data mining analysis, etc.)? With the sizes you report Carrot2 won't work for you, I'm afraid, but Mahout may. Still, there's plenty

Re: Clustering with Lucene?

2011-04-26 Thread Dawid Weiss
> 1) We index around 20 fields, of that we want to have grouping option > for five of them. For ex., user can search on name of the city and we > should have option to group by products available in that city (and > vice-versa). > Are these fields stricly defined or free text? Because if they are

Re: Clustering with Lucene?

2011-04-26 Thread Dawid Weiss
> that IP etc. These are definitely not dictionary fields. > > I'm looking at faceting right now - checking if this would work with > Lucene (as we can not change to Solr at this point). What's the main > difference between clustering and faceting? > > Thanks, > -vivek &

Re: Immutable OpenBitSet?

2011-04-27 Thread Dawid Weiss
There are solutions to solve the initialization problem. The JVM guarantees > that an object is consistent after the ctor is run, so you can do the > initialization like this (please note the double {{}}, which is an inline > ctor, this is also often seen for unmodifiable HashSets): > > final OpenB

Re: Immutable OpenBitSet?

2011-04-28 Thread Dawid Weiss
> In general a *newly* created object that was not yet seen by any other > thread is always safe. This is why I said, set all bits in the ctor. This > is > easy to understand: Before the ctor returns, the object's contents and all > references like arrays are not seen by any other thread (that's >

Re: Immutable OpenBitSet?

2011-04-28 Thread Dawid Weiss
> static void writer() { >f = new FinalFieldExample(); > } > static void reader() { >if (f != null) { > int i = f.x; // guaranteed to see 3 > int j = f.y; // could see 0 >} > } >} > In this snippet of code there's not even a guarant

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Dawid Weiss
Don't know if this helps, but debugging stuff like this I simply add a (manually inserted or aspectj-injected) recursion count, add a breakpoint inside an if checking for recursion count >> X and run the vm with an attached socket debugger. This lets you run at (nearly) full speed and once you hit

Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Dawid Weiss
> lucene/Search that is taking the time, I also had another attempt using > luke > > but find it incredibly buggy and of little use > Can you expand on this too? What kind of "incredible bugs" did you see? Without feedback there is little progress, so bug reports count. Dawid

Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Dawid Weiss
:17, Dawid Weiss wrote: > > > > > lucene/Search that is taking the time, I also had another attempt using >> luke >> > but find it incredibly buggy and of little use >> > > Can you expand on this too? What kind of "incredible bugs" did you see? >

Lucene 3.0.3 with debug information

2011-04-29 Thread Dawid Weiss
This is the e-mail you're looking for, Steven (it wasn't forwarded to the list, apparently). Dawid -- Forwarded message -- From: Paul Taylor Date: Fri, Apr 29, 2011 at 10:11 PM Subject: Re: Lucene 3.0.3 with debug information To: Dawid Weiss On 29/04/2011 15:17, D

Re: Using Solr's (Auto)suggest with plain lucene

2011-05-05 Thread Dawid Weiss
If you check out the source code of solr/lucene, look at FSTLookup class and FSTLookupTest -- you can populate FSTLookup manually with terms/ phrases from your index and then use the resulting automaton for suggestions. Dawid On Thu, May 5, 2011 at 2:54 PM, Clemens Wyss wrote: > I have implemen

Re: Using Solr's (Auto)suggest with plain lucene

2011-05-06 Thread Dawid Weiss
p "sits" behind this suggester > http://search-lucene.com/m/586gA4ccL11 > here? > > > > -Ursprüngliche Nachricht- > > Von: Dawid Weiss [mailto:dawid.we...@gmail.com] > > Gesendet: Donnerstag, 5. Mai 2011 15:00 > > An: java-user@lucene.apache.org

Re: Using Solr's (Auto)suggest with plain lucene

2011-05-06 Thread Dawid Weiss
uggested terms > untouched, i.e. cased. > > Clemens > > > -Ursprüngliche Nachricht- > > Von: Dawid Weiss [mailto:dawid.we...@gmail.com] > > Gesendet: Freitag, 6. Mai 2011 11:12 > > An: java-user@lucene.apache.org > > Betreff: Re: Using Solr's

Re: Using Solr's (Auto)suggest with plain lucene

2011-05-06 Thread Dawid Weiss
to add "fuzzy" lookup for terms? > > E.g.: > "melo" should also bring up "merlot" > > > -Ursprüngliche Nachricht- > > Von: Dawid Weiss [mailto:dawid.we...@gmail.com] > > Gesendet: Freitag, 6. Mai 2011 11:30 > > An: java-user@

German compound decomposition (native speakers: help needed).

2011-06-14 Thread Dawid Weiss
First of all I should probably congratulate my fellow Germans -- Dirk Nowitzki's outstanding performance during this year's NBA finals will become part of the history of basketball. As a Pole, I admit I'm really freaking jealous. Now... back to the subject. A number of people have expressed an in

Re: Lucene sort performance roots?

2011-06-23 Thread Dawid Weiss
Can you describe the kind of sorting you're doing? Maybe the data is already sorted (and in RAM) and you're only getting it out? Dawid On Fri, Jun 24, 2011 at 3:32 AM, Denis Bazhenov wrote: > Well, maybe it's a bit controversial question, but anyway... > > Lucene is a great toolkit for search ap

Re: Lucene sort performance roots?

2011-06-24 Thread Dawid Weiss
ting by field value. We have around 1M documents > which we are searching and returns them to the user in reverse order by > creation date. Creation date is indexed in separated field in lucene of > course. > > On Jun 24, 2011, at 4:52 PM, Dawid Weiss wrote: > >> Can yo

Re: Autocompletion on large index

2011-07-07 Thread Dawid Weiss
Elmer. Tst will have a large overhead. Fst may not be that much better if your input has very few shared pre or suffixes. In your case i think this is unfortunately true. What i would do is create a regular lucene index and store it on disk. Then run prefix queries on it. Should work and scale to l

Re: Autocompletion on large index

2011-07-07 Thread Dawid Weiss
tion today but maybe we > should? Suffix sharing requires sizable RAM while building because it > maintains a hash containing all nodes in order to locate the dups. > > It's also possible to improve FST to have shades of gray between > on/off... I'll open an issue. > &g

Re: Autocompletion on large index

2011-07-07 Thread Dawid Weiss
while ago, but I've been swamped with other work, sorry. Dawid On Thu, Jul 7, 2011 at 7:16 PM, Michael McCandless wrote: > On Thu, Jul 7, 2011 at 7:00 AM, Dawid Weiss wrote: >> Another option to tradeoff dize and mem is to do a lru like cache of suffix >> nodes/ registry. Im s

Re: SSD Experience

2011-08-23 Thread Dawid Weiss
This one is humorous (watch for foul language though). It does get to the point, however, and Bergman is a clever guy: http://www.livestream.com/oreillyconfs/video?clipId=pla_3beec3a2-54f5-4a19-8aaf-35a839b6ecaa Dawid On Tue, Aug 23, 2011 at 10:00 AM, Toke Eskildsen wrote: > On Mon, 2011-08-22

Re: SSD Experience

2011-08-23 Thread Dawid Weiss
> > > We installed SSDs in all developer machines in 2009 (Intel X25) and > haven't looked back. > > I can confirm this from my own experience. Once you have a (fast) SSD on your development machine you are not likely to go back to a spinning drive... Dawid

Re: Lucene killing JVM

2011-09-01 Thread Dawid Weiss
Also, run memtest on your machine to rule out memory corruption; this unfortunately may cause effects like the one you're describing. Dawid On Thu, Sep 1, 2011 at 11:21 AM, Federico Fissore wrote: > Dragan Jotanovic, il 01/09/2011 11:12, ha scritto: >> >> Hi, >> I recently upgraded to lucene 3.3

Re: Bet you didn't know Lucene can...

2011-10-23 Thread Dawid Weiss
Hi Grant, In Carrot2 (and Carrot Search's commercial products) we're not using Lucene as an indexing/ search service directly, but we are re-using a lot of internal infrastructure (like analyzers, ported snowball stemmers and other segmentation stuff). We also plan on using the new language identi

Re: AlreadySetException ?

2011-10-24 Thread Dawid Weiss
> What can possibly cause this exception? I can't be calling the constructor of > IndexWriter twice, can I ;) I beet Chuck Norris can do that! :) Dawid - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For addit

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Dawid Weiss
Avg lookup time slightly less than a HashSet? Interesting. Is the code to these benchmarks available somewhere? Dawid On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote: > > On Oct 25, 2011, at 11:26 AM, mark harwood wrote: > using Lucene that don't fit under the core premise of full te

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Dawid Weiss
> Lucene started out at an avg 3ms but subsequent runs took it down > dramatically due to OS file caching. The all-in-memory hashset implementation > clearly did not demonstrate the same speed ups between runs. I don't say the benchmark was wrong or anything, but this is surprising. I mean, the

Re: Bet you didn't know Lucene can...

2011-10-26 Thread Dawid Weiss
m also using public domain Wikipedia data so can release the code and data > somewhere if that's of interest. > > Cheers > Mark > > > > - Original Message - > From: Dawid Weiss > To: java-user@lucene.apache.org > Cc: > Sent: Tuesday, 25 October 2011,

Re: Suggest with FST

2011-11-16 Thread Dawid Weiss
I am currently working on a refactoring of FSTLookup so that either one or both of your objectives will be possible. I would still argue that storing exact scores does not make much sense (think: if you collect query logs then you probably won't differentiate between two suggestions that differ by

Re: JLemmaGen project

2013-11-04 Thread Dawid Weiss
Hi Michal, Pretty cool. Your work reminds me of what Leo Galambos did a while back: http://link.springer.com/chapter/10.1007/978-3-540-39985-8_22 I believe his implementation is still available in the Egothor search engine project. Dawid On Wed, Oct 23, 2013 at 5:17 PM, Michal Hlavac wrote:

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss
Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively easy) to create an analyzer (or a modificatio

Re: BTRFS ?

2014-12-21 Thread Dawid Weiss
> I spotted Uwe's comment in JIRA the other day "BTRFS, which might also > bring some cool things for Lucene.". What cool things about BTRFS are you talking about, Uwe? Just curious. Dawid - To unsubscribe, e-mail: java-user

Re: BTRFS ?

2014-12-22 Thread Dawid Weiss
.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Dawid Weiss [mailto:dawid.we...@gmail.com] >> Sent: Monday, December 22, 2014 8:48 AM >> To: java-user@lucene.apache.org >> Cc: Uwe Schindler >> Subject: Re: BTRFS ? >&

Re: BTRFS ?

2014-12-23 Thread Dawid Weiss
> This could speed up tests, especially Solr where some dirs are copied over > and over for every test case. :-) A wild idea, but since there's NIO everywhere now you could use an in-memory filesystem for tests and avoid going to disk entirely :D https://github.com/google/jimfs Dawid -

Re: [ANNOUNCE] Apache Lucene 5.0.0 released

2015-02-20 Thread Dawid Weiss
Thanks for contributing time to the release, Anshum. Dawid On Fri, Feb 20, 2015 at 10:16 PM, Anshum Gupta wrote: > Sure, I'll fix that on the wiki. Thanks for pointing that out Uwe. > > On Fri, Feb 20, 2015 at 1:10 PM, Uwe Schindler wrote: > >> Many thanks! :-) Nice work! >> >> I found a small

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Dawid Weiss
Yes, BytesRef can be surprising. No, it probably won't change in Lucene to comply with superb design principles. Yes, the odd design is there for performance reasons and it does provide noticeable gain. Perhaps you could file a JIRA issue to improve the documentation, this would be helpful. For wh

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Dawid Weiss
> BytesRef is not different, because it is just a "reference" to pass around. > And cloning a reference for sure should not clone the target of the > reference. You are "cloning" the reference and only that (as the name of the > class says: Bytes*Ref*)! Exactly. It is a reference and as such, c

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Dawid Weiss
> Otherwise, it violates the Liskov substitution principle as well. Sadly it also violates the Heisenberg's principle at the bit state energy levels. We're working on improving that. >From your heated comments I think you should switch the language to something that guarantees immutability of any

Re: Why do the Japanese analyser FST files change every release?

2015-08-06 Thread Dawid Weiss
It is (b). D. On Fri, Aug 7, 2015 at 3:05 AM, Trejkaz wrote: > I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2. > > During this process, I noticed that the FST used by the Japanese > analyser (AKA Kuromoji) was changing between releases. As I fear > breakages in backwards comp

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Dawid Weiss
I think the issue here is what happens if an "uppercase" codepoint requires a surrogate pair and the lowercase counterpart does not -- then the index variable would indeed be screwed. Dawid On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler wrote: > Hi, > > > Setting aside the fact that Character.

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Dawid Weiss
unt(Character.toLowerCase(cp)); if (c1 != c2 || c1 != c3) { System.out.println(String.format(Locale.ROOT, "%d %d %d", c1, c2, c3)); } } D. On Thu, Oct 22, 2015 at 10:15 AM, Dawid Weiss wrote:

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Dawid Weiss
> LowerCaseFilter will not handle that. So whereas it is "safe" for > English hard-coded strings, it isn't safe for all fields you might > index in general. This filter is a "safe" fallback that works identically regardless of the locale you have on your computer (or on the server). This, I believ

Re: IndexWriter.addIndexes with LeafReader parameter

2016-01-12 Thread Dawid Weiss
You can addIndexes(Directory... dirs) -- then you don't have to deal with CodecReader? Dawid On Tue, Jan 12, 2016 at 4:43 PM, Manner Róbert wrote: > Hi, > > we have used lucene 4.7.0 before, we are on the way to upgrade to 5.4.0. > > The problem I have is that writer.addIndexes now needs CodecRe

Re: Lucene indexing throughput (and Mike's lucenebench charts)

2016-04-14 Thread Dawid Weiss
The GC change is after this: BJ (2015-12-02): Upgrade to beast2 (72 cores, 256 GB RAM) which leads me to believe these results are not comparable (different machines, architectures, disks, CPUs perhaps?). Dawid On Thu, Apr 14, 2016 at 7:13 PM, Otis Gospodnetić wrote: > Hi, > > I was looking a

Re: Re: Why Two Levels of Indirection in BytesRefHash class ?

2016-05-09 Thread Dawid Weiss
You could try to implement this refactoring, which would combine linear storage of values (without the need to save the length of each key explicitly) with their incremental addition order. https://issues.apache.org/jira/browse/LUCENE-5854 The outcome may or may not be faster in practice (due to

Re: Levenshtein FST's?

2016-05-28 Thread Dawid Weiss
> Point taken, but I wonder if there's an algorithmic shortcut to determinize > the union of Levenshtein DFAs... Levenshtein DFA is an automaton like any other; when you merge two such automata, they will very likely contain states that need to be merged (and their transition split) in order to be

Re: Levenshtein FST's?

2016-05-29 Thread Dawid Weiss
> I think I see this now, and how skipping determinization and matching with > the NFA could easily leave you with an intractable amount of backtracking > for even the simpler binary question of does my input match any of the > automatons I've unioned. Note that with NFAs you may answer the questi

Re: How can I get the term positions from a query?

2016-09-29 Thread Dawid Weiss
There are multiple Highlighter implementations for this purpose. Check them out -- I'm sure one of them will suit your needs. In fact, there's a new highlighter implemented very recently! Check out this JIRA issue: https://issues.apache.org/jira/browse/LUCENE-7438 Dawid On Fri, Sep 30, 2016 at 8

Re: Query parser and default operator

2016-11-09 Thread Dawid Weiss
Which Lucene version and which query parser is this? Can you provide a test case/ code sample? I just tried with StandardQueryParser and for: sqp.setDefaultOperator(StandardQueryConfigHandler.Operator.AND); dump(sqp.parse("foo AND bar OR baz", "field_a")); sqp.setDefaultOpe

Re: Query parser and default operator

2016-11-10 Thread Dawid Weiss
curl -s 'localhost:9200/test/_search?pretty' -d '{ "query": { > "query_string": { "query": "foo AND bar OR baz" , "default_operator": "or" > } } , "profile" : true}' | grep luce > "lucene&qu

Re: question

2017-01-20 Thread Dawid Weiss
> But it is fairly trivially to tweak/extend the query parser to produce > diff behavior. I think the conclusion for the original poster should be that there's really not enough information to provide a definite answer. Lucene is a search engine. Much like with a mechanical engine, its final appli

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Dawid Weiss
You could try using morfologik's byte-based implementation: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java I can't guarantee it'll be fast enough -- you need to sort those input sequences and even thi

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Dawid Weiss
tate registry > more ram efficient too ... I think it's essentially the same thing as > the FST.Builder's NodeHash, just minus the outputs that FSTs have vs > automata. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Feb 15, 2017 a

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-20 Thread Dawid Weiss
> PatriciaTrie. In particular building an FST with doShareSuffix = false is > the fastest of any option, If you don't share the suffix then you are building a kind of Patricia trie... But suffix sharing is cheap and can give you a memory saving (and resulting cache locality sometimes) that is non-

Re: codec: accessing term dictionary

2017-03-10 Thread Dawid Weiss
Or you could encode those term/ ngram frequencies one FST and then reuse it. This would be memory-saving and fairly fast (~comparable to a hash table). Dawid On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless wrote: > Yes, this is a reasonable way to use Lucene (to see terms statistics across

Re: Automata and Transducer on Lucene 6

2017-04-18 Thread Dawid Weiss
> I'd like to read something written by who designed these classes. What > motivated, usage examples, what it is good for and what it is not good for. > Maybe a history of the development of Automata on Lucene Are you looking for a historical book on Lucene development or are you looking to solve

Re: Automata and Transducer on Lucene 6

2017-04-18 Thread Dawid Weiss
> One small correction: we moved away from objects to more compact int[] a > while ago for our automata implementation. Right, forgot about that. There are still some trappy object-heavy utilities like this one: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/luc

Re: Automata and Transducer on Lucene 6

2017-04-19 Thread Dawid Weiss
> Dawid, the thing is that I am not even sure that Automata are the perfect > fit for my project and I thought some literature on it would help me decide > whether to use it or not. Still looks to me like you're approaching the problem from the wrong side or don't want to share the core problem wh

Re: Highlighting and delineating Passages (fragmenting)

2017-05-27 Thread Dawid Weiss
Thanks for your explanation, David. I actually found working with all Lucene highlighters pretty difficult. I have a few requirements which seemed deceptively simple: 1) highlight query hit regions (phrase, fuzzy, terms); 2) try to organise the resulting snippets to visually "center" the hit regi

Re: Highlighting and delineating Passages (fragmenting)

2017-05-30 Thread Dawid Weiss
> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3. > The UH can't currently do this; but with the OH (original Highlighter) you > can but it appears somewhat awkward. See SimpleSpanFragmenter. I had said > it was easy but I was mistaken; I'm getting rustier on the OH.

Re: Highlighting and delineating Passages (fragmenting)

2017-05-30 Thread Dawid Weiss
https://issues.apache.org/jira/browse/SOLR-1105 Yes, this is spot-on what I need with regard to copyTo fields, thanks for the link! > Or are the overlaps coming from passage offset ranges from separate queries > to the same content? The overlaps are caused by the fact that we have multiple sour

Re: Java 9 issues

2017-07-28 Thread Dawid Weiss
> it will be good if Lucene team can share their plans for a full java 9 > support (e.g. named modules of Lucene libraries) So, here it is: we plan to support it. (*) Dawid (*) When it's stabilized and documented (it still isn't) [1]. And when somebody has the time to do it (patches welcome, it'

Re: German decompounding/tokenization with Lucene?

2017-09-16 Thread Dawid Weiss
Hi Mike. Search lucene dev archives. I did write a decompounder with Daniel Naber. The quality was not ideal but perhaps better than nothing. Also, Daniel works on languagetool.org? They should have something in there. Dawid On Sep 16, 2017 1:58 AM, "Michael McCandless" wrote: > Hello, > > I ne

Re: Binary Automaton

2017-09-30 Thread Dawid Weiss
> Hi , it is possible to create a Automaton in lucene parsing not a string > but a byte array? Can you state what problem are you trying to solve? This seems to be a question stripped of a more general context -- why do you need those byte-based automata? Dawid --

Re: Binary Automaton

2017-09-30 Thread Dawid Weiss
for example, > be usefull in bioinformatic or all those cases where data is not a basic > ADT. > > Cristian > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss : > >> > Hi , it is possible to create a Automaton in lucene parsing not a string >> > but a byte array?

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-01 Thread Dawid Weiss
How about the quickest solution: dump the content of both indexes to a document-per-line text file, sort, diff? Even if your indexes are large, if you have large spare disk, this will be super fast. Dawid On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra wrote: > Hi, > > We use Lucene for indexin

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss
basis using the Lucene API? > Chetan Mehrotra > > > On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss wrote: >> How about the quickest solution: dump the content of both indexes to a >> document-per-line text >> file, sort, diff? >> >> Even if your indexes are large,

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss
t. Actual indexed content would be same if both index have > "status" field indexed so we only need to validate fieldnames per > document. Something like > > Thanks for reading all this if you have read so far :) > > Chetan Mehrotra > [1] > https://github.com/apach

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-03 Thread Dawid Weiss
> That helps and explains why there is no support in std api This isn't an API problem. This is by design -- this is how it works. If you wish to retrieve fields that are indexed and stored with the document, the API provides such an option (indexed and stored field type). Your indexed fields are

Re: testing with system properties

2018-08-09 Thread Dawid Weiss
Erick already pointed you at the "cleanup" rule. This is fairly generic, but if you know the properties being modified you should still clean them up in @After or @AfterClass -- this is useful for other people to know that you're modifying them, if for nothing else. Randomized testing package has

Re: RamDirectory vs MemoryIndex vs MMapDirectory for In-Memory-Index

2018-09-25 Thread Dawid Weiss
Use MMapDirectory on a temporary location, Matthias. If you really need in-memory indexes, a new Directory implementation is coming (RAMDirectory will be deprecated, then removed), but the difference compared to MMapDirectory is typically not worth the hassle. See this issue for more discussion. h

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Dawid Weiss
We are faced with a similar situation. Yes, the merge process can take a long time and is mostly single-threaded (if you're merging from N segments into a single segment, only one thread does the job). As Erick pointed out, the merge process takes a backseat compared to indexing and searches (in mo

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Dawid Weiss
> int processors = Runtime.getRuntime().availableProcessors(); > int ConcurrentMergeScheduler cms = new ConcurrentMergeScheduler(); > cms.setMaxMergesAndThreads(processors,processors); See the number of threads in the CMS only matters if you have concurrent merges of independent segments. What you

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Dawid Weiss
Thanks for chipping in, Toke. A ~1TB index is impressive. Back of the envelope says reading & writing 900GB in 8 hours is 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interface for our SSD machine, but even with SATA II this is only ~1/5th of the possible fairly sequential IO throughput. So f

Re: Static index, fastest way to do forceMerge

2018-11-30 Thread Dawid Weiss
/jira/browse/LUCENE-8580 Dawid On Fri, Nov 2, 2018 at 10:17 PM Dawid Weiss wrote: > > Thanks for chipping in, Toke. A ~1TB index is impressive. > > Back of the envelope says reading & writing 900GB in 8 hours is > 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interfa

Re: RAMDirectory or Redis

2018-12-02 Thread Dawid Weiss
bq. We switched to ByteBuffersDirectory with 7.5, but I actually didn't see much performance improvements or savings in memory. Once the indexes are built I don't think there will be much of a difference. The core problem with RAMDirectory was related to synchronizations during merges/ file manipu

Re: RamUsageCrawler

2018-12-06 Thread Dawid Weiss
> It's entirely possible it fails to dig into Maps correctly with newer Java > releases; maybe Dawid or Uwe would know? We have removed all reflection from that class a while ago exactly because of encapsulation issues introduced in newer Java versions. https://github.com/apache/lucene-solr/blob/

Re: RamUsageCrawler

2018-12-06 Thread Dawid Weiss
ashMap by simply counting the size of the > Node that is used for each entry, although given the dynamic nature of > these data structures (HashMap eg can use TreeNodes sometimes > depending on data distribution) it would be almost impossible to be > 100% accurate. > On Thu, Dec 6, 2018

Re: Static index, fastest way to do forceMerge

2018-12-18 Thread Dawid Weiss
k on it. > > Regards, > Jerven > On 11/30/18 12:01 PM, Dawid Weiss wrote: > > Just FYI: I implemented a quick and dirty PoC to see what it'd work > > like. Not much of a difference on my machine (since postings merging > > dominates everything else). Interesting prob

Re: Slowness on Java 11 with Lucene 6

2019-07-30 Thread Dawid Weiss
> We have chosen G1GC for both Java 8 and Java 11 versions. It's not like we have answers for everything. ;) If it's the same GC on both and there is still a slowdown then something else may be causing it -- hard to tell without doing trial-and-error. There is a set of performance benchmarks; perh

  1   2   >