Re: Twitter analyser
This is a parts-of-speech analyzer for tweets. It would make your index far more useful. http://www.ark.cs.cmu.edu/TweetNLP/ On 11/04/2013 11:40 PM, Stéphane Nicoll wrote: Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find a combination where the following would work: * foo matches the foo word, a mention (@foo) or the hashtag (#foo) * @foo only matches the mention * #foo matches only the hashtag It should matches complete word so I used the WhiteSpaceAnalyzer for indexing. Any recommendation for this use case? Thanks ! S. Sent from my iPhone - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: JLemmaGen project
This is very cool! Lemmatization is an important tool for making search work better. Would you consider changing the licensing to the Apache 2.0 license? On 10/23/2013 08:17 AM, Michal Hlavac wrote: Hi, I rewrote lemmatizer project LemmaGen (http://lemmatise.ijs.si/) to java. Originally it's written in C#. Lemmagen project uses rules to lemmatize word. Algorithm is described here: http://lemmatise.ijs.si/Download/File/Documentation%23JournalPaper.pdf Project is writtten under GPLv3. Sources are located on bitbucket server: https://bitbucket.org/hlavki/jlemmagen There is also Lemmagen4j project which use more memory and without prebuilded trees. I obtained also licenced dictionaries to build rules tree for 15 languages. Dictionaries are licenced, but prebuilded trees don't. But you can also build your own dictionary. Project contains also TokenFilter for lucene/solr. Project is not stable, but any feedback is appreciated. Supported languages are: mlteast-bg - Bulgarian mlteast-cs - Czech mlteast-en - English mlteast-et - Estonian mlteast-fr - French mlteast-hu - Hungarian mlteast-mk - Macedonia mlteast-pl - Polish mlteast-ro - Romanian mlteast-ru - Russian mlteast-sk - Slovak mlteast-sl - Slovene mlteast-sr - Serbian mlteast-uk - Ukrainian thanks, miso - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: posting list strings
Is there a Trie-based term index? Seems like this would be smaller, and very fast on non-leading wildcards. On 07/09/2013 02:34 PM, Uwe Schindler wrote: Hi, You can replace the term by their hash directly in the analyzer chain. Just write a custom TermToBytesRef attribute that hashes the term to a constant-length byte[] (using a AttributeFactory)! :-) This would give you all features of hashed, constant length terms, but you would lose prefix and wildcard queries. In fact, NumericTokenStream is doing this for numeric! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Tuesday, July 09, 2013 11:25 PM To: java-user@lucene.apache.org Subject: Re: posting list strings Hi, Lucene stores the string because it may need it to run prefix or range queries. We don't have a hash-based terms dictionary right now but I know some people wrote one since they don't need support for these queries, see for instance the Earlybird paper[1]. Then if you can find a perfect hashing function, you can just replace your terms by their hash. [1] http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012. pdf -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: In memory index (current status in Lucene)
My current open source project is a Directory that is just like RAMDirectory, but everything is memory-mapped. The idea is it creates a disk file, opens it, and immediately deletes the file. The file still exists until the IndexReader/Writer/Searcher closes it. But, it cannot be found from the file system. This is just like a RAMDirectory, but without memory limitations. It's proving to be harder than it looked. The application is to store encrypted indexes in memory, with the decrypted contents in this non-findable format. I'm in medical document analysis now, and we can't store anything on disk in the clear. Lance On 07/01/2013 07:07 AM, Emmanuel Espina wrote: Hi Erick! Nice to hear from you again! From time to time my interest in these Lucene things returns and I do some experiments :p Just to add to this conversation, I found an interesting link to Mike's blog about memory resident indexes (using another virtual machine) http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html and also (which is not exactly what I asked but seems related) there is a Google Summer of Code project to build a memory residen term resident: http://www.google-melange.com/gsoc/project/google/gsoc2013/billybob/42001 Thanks Emmanuel 2013/7/1 Erick Erickson erickerick...@gmail.com: Hey Emma! It's been a while Building on what Steven said, here's Uwe's blog on MMapDirectory and Lucene: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html I've always considered RAMDirectory for rather restricted use-cases. I.e. if I know without doubt that the index is both relatively static and bounded. The other use I've seen is to use it to index single documents on-the-fly for some reason (say complex processing of a single result) then throw it out afterwards. How are things going? Erick On Fri, Jun 28, 2013 at 5:36 PM, Steven Schlansker ste...@likeness.comwrote: On Jun 28, 2013, at 2:29 PM, Emmanuel Espina espinaemman...@gmail.com wrote: I'm building a distributed index (mostly as a reasearch project for school) and I'm evaluating indexing the entire collection in memory (like google, facebook and others have done years ago). The obvious reason for this is performance considering that the replication will give me a reasonably good durability of the data (despite being in volatile memory). What is the current status of Lucene for this kind of indexes? RAMDirectory in it's documentation has a scary warning that says that is not intended to work with huge indexes, and that sounds more like it is an implementation for testing rather than something for production. Of course there is no real context for this question, because it is a reasearch topic. Testing it's limits would be the closest to a context I have :p You could consider MMapDirectory, which will end up putting the active portions of the index in memory (via the filesystem buffer cache). The benefit is that you don't completely destroy the Java heap (RAMDirectory causes immense GC pressure if you are not careful) and you don't have to commit all of your ram to index usage all the time. The downside is that if your working set exceeds the amount of RAM available for buffer cache, you will get silent performance degradation as you fall back to disk reads for the missing blocks. Maybe this is OK for your use case, maybe not. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Content based recommender using lucene/solr
Solr/Lucene has two features for this: 1) the MoreLikeThis code, and 2) the clustering project in solr/contrib. Lance On 06/28/2013 11:15 AM, Luis Carlos Guerrero Covo wrote: I only have about a million docs right now so scaling is not a big issue. I'm looking to provide a quick implementation and then worry about scale when I get around to implementing a more robust recommender. I'm looking at a content based approach because we are not tracking users and items viewed by users. I was thinking of using morelikethis like walter mentioned, but wanted some feedback on the nuances required for a proper implementation like having a similarity based on euclidean distance, normalizing numerical field values and computing collection wide stats like mean and variance. Thank you for the link Otis, I will watch it right away. On Fri, Jun 28, 2013 at 1:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, It doesn't have to be one or the other. In the past I've built a news recommender engine based on CF (Mahout) and combined it with Content Similarity-based engine (wasn't Solr/Lucene, but something custom that worked with ngrams, but it may have as well been Lucene/Solr/ES). It worked well. If you haven't worked with Mahout before I'd suggest the approach in that video and going from there to Mahout only if it's limiting. See Ted's stuff on this topic, too: http://www.slideshare.net/tdunning/search-as-recommendation + http://berlinbuzzwords.de/sessions/multi-modal-recommendation-algorithms (note: Mahout, Solr, Pig) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at 2:07 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: You could build a custom recommender in mahout to accomplish this, also just out of curiosity why the content based approach as opposed to building a recommender based on co-occurence. One other thing, what is your data size, are you looking at scale where you need something like hadoop? From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 13:02:00 -0500 Subject: Re: Content based recommender using lucene/solr To: solr-u...@lucene.apache.org CC: java-user@lucene.apache.org Hey saikat, thanks for your suggestion. I've looked into mahout and other alternatives for computing k nearest neighbors. I would have to run a job and computer the k nearest neighbors and track them in the index for retrieval. I wanted to see if this was something I could do with lucene using lucene's scoring function and solr's morelikethis component. The job you specifically mention is for Item based recommendation which would require me to track the different items users have viewed. I'm looking for a content based approach where I would use a distance measure to establish how near items are (how similar) and have some kind of training phase to adjust weights. On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Why not just use mahout to do this, there is an item similarity algorithm in mahout that does exactly this :) https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html You can use mahout in distributed and non-distributed mode as well. From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 12:16:57 -0500 Subject: Content based recommender using lucene/solr To: solr-u...@lucene.apache.org; java-user@lucene.apache.org Hi, I'm using lucene and solr right now in a production environment with an index of about a million docs. I'm working on a recommender that basically would list the n most similar items to the user based on the current item he is viewing. I've been thinking of using solr/lucene since I already have all docs available and I want a quick version that can be deployed while we work on a more robust recommender. How about overriding the default similarity so that it scores documents based on the euclidean distance of normalized item attributes and then using a morelikethis component to pass in the attributes of the item for which I want to generate recommendations? I know it has its issues like recomputing scores/normalization/weight application at query time which could make this idea unfeasible/impractical. I'm at a very preliminary stage right now with this and would love some suggestions from experienced users. thank you, Luis Guerrero -- Luis Carlos Guerrero Covo M.S. Computer Engineering (57) 3183542047 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Please add me as a wiki editor
I'm responsible for the OpenNLP wiki page: https://wiki.apache.org/solr/OpenNLP Please add me to the list of editors.
Re: Taking backup of a Lucene index
The simple answer (that somehow nobody gave) is that you can make a copy of an index directory at any time. Indexes are changed in generations. The segment* files describe the current generation of files. All active indexing goes on in new files. In a commit, all new files are flushed to disk and then the segment* files change. At any point in this sequence, all of the files in the directory form one consistent index. This isn't like MySQL or other databases where you have to shut down the DB to get a safe copy of the files. Lance On 04/17/2013 03:57 AM, Ashish Sarna wrote: I want to take back-up of a Lucene index. I need to ensure that index files would not change when I take their backup. I am concerned about the housekeeping/merge/optimization activities which Lucene performs internally. I am not sure when/how these activities are performed by Lucene and how we can prevent them. My application (which allows indexing and searching over the created indexes) keeps running in the background. I can ensure that nothing is written to the indexes by my application when I take their backup, but I am not sure whether indexes would change in some manner when a search is performed over it. How can I ensure that an index would not change (i.e., no housekeeping/merge/optimization activity is performed by Lucene) when I take its backup? Any help would be much appreciated. PS: Currently I am using Lucene 2.9.4 but wish to upgrade it to 3.6.2. Regards Ashish - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Zero-position query?
Thanks! Now, to hunt for this in the parsers. On 06/02/2013 09:16 PM, Israel Tsadok wrote: You can do this with a PhraseQuery[1]. Just add more terms with position 0. [1] http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/PhraseQuery.html#add(org.apache.lucene.index.Term, int) On Mon, Jun 3, 2013 at 6:46 AM, Lance Norskog goks...@gmail.com wrote: What is a Lucene query that will find two words at the same term position? Is there a class that will do this? Is the feature available from the Lucene query syntax or any other syntax parsers? For example, if I'm using synonyms at index time I should get the base word and all synonyms at the same position. What is a query that will find a document with the synonym substituted, but will not find a document which has the base word and a synonym at two different positions? Thanks, Lance. --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Zero-position query?
What is a Lucene query that will find two words at the same term position? Is there a class that will do this? Is the feature available from the Lucene query syntax or any other syntax parsers? For example, if I'm using synonyms at index time I should get the base word and all synonyms at the same position. What is a query that will find a document with the synonym substituted, but will not find a document which has the base word and a synonym at two different positions? Thanks, Lance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: StandardAnalyzer: Support for Japanese
3.x and 4.0 Solr releases have nice analyzers just for Japanese. In 4.0 they are the Kuromoji package. In 4.0, the JapaneseAnalyzer probably does what you need: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-kuromoji/4.0.0/org/apache/lucene/analysis/ja/JapaneseAnalyzer.java?av=f 3.6 also has the Kuromoji package, but I don't know how advanced it is compared the 4.x version. Cheers! On 01/10/2013 11:19 AM, saisantoshi wrote: We are using StandardAnalyzer for indexing some Japanese Keywords. It works fine so far but just wanted to confirm if the StandardAnalyzer can fully support it ( I have read somewhere in Lucene In Action book, that StandardAnalyzer does support CJK). Just want to confirm if my understanding is correct? or do we need to use a specific analyzer for processing Japanese Keywords. Alternatively, is there a stop words list for Japanese Language so that we can add an extra filter to the Standard Analyzer. Any thoughts on this is much appreciated. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/StandardAnalyzer-Support-for-Japanese-tp4032290.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: potential memory leak when using RAMDirectory ,CloseableThreadLocal and a thread pool .
There were memory leak problems with earlier versions of Java. You should upgrade to Java 6_30. Lance On 01/02/2013 05:26 AM, Alon Muchnick wrote: Hello All , we are using Lucune 3.6.2 in our web application on tomcat 5.5 and recently we started testing our application on tomcat 7, unfortunately we seem to encounter a memory link in Lucune's CloseableThreadLocal class , any help with solving the below issue would be much appreciated. we are using RAMDirectory for our Indexes , while testing the application on tomcat 7 we noticed that there is a memory leak in our application , after taking a heap dump we can see the memory leak is in the: Type |Name|Value ref | index | org.apache.lucene.store.RAMDirectory --- ref | core| org.apache.lucene.index.SegmentReader$CoreReaders --- -- ref | tis | org.apache.lucene.index.TermInfosReader --- -- - ref | threadResources | org.apache.lucene.util.CloseableThreadLocal --- - ref | hardRefs | java.util.HashMap @ 0x9d566938 i guess the HashMap is used for caching purposes and it hold entries where the key is a thread name and the value is a org.apache.lucene.index.TermInfosReader$ThreadResources object . *even when i stop new incoming connection to the application , tomcat closes all the active threads and a GC is run the above map size is not reduced and GC cannot reclaim the heap space . * the problem looks some what similar to LUCENE-3841 https://issues.apache.org/jira/browse/LUCENE-3841 but we are no using SnowballAnalyzer . ( i checked the code and made sure the hardRefs map is a WeakHashMap ) our JVM is : OpenJDK 64-Bit Server VM , java.runtime.version1.6.0_20-b20 one again any help would be much appreciated. thanks Alon - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Pulling lucene 4.1
4.x does not promise backwards compatibility with 3.x. Have you made your own extensions? On 01/02/2013 04:38 AM, Shai Erera wrote: There's no specific branch for 4.1 yet. All development still happens on the 4x branch ( http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/). Note that Lucene maintains two active branches for development: 'trunk' (currently to be 5.0) and '4x' off of which all Lucene 4.x releases are created. Shai On Wed, Jan 2, 2013 at 11:57 AM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: Dear all, Would be glad to know on which branch of lucene is the development happening on version 4.1. Would be glad if you can share the repo URL, we are testing out certain features of 4.1 including CompressingStoredFieldsFormat. Currently we are pulling from trunk, which I guess is 5.x branch. Very particular about 4.1 because, we need backward compatibility with 3.x. Thanks in advance. -- With Thanks and Regards, Ramprakash Ramamoorthy, India. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which token filter can combine 2 terms into 1?
How do you choose t2 and t2a? If you have a full inventory of these pairs, you can make these multi-word synonyms and use the Synonym filter to combine them. On 12/20/2012 11:50 PM, Xi Shen wrote: Hi, I am looking for a token filter that can combine 2 terms into 1? E.g. the input has been tokenized by white space: t1 t2 t2a t3 I want a filter that output: t1 t2t2a t3 I know it is a very special case, and I am thinking about develop a filter of my own. But I cannot figure out which API I should use to look for terms in a Token Stream. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to implement a TokenFilter?
Go to the top directory and do this: cp dev-tools/eclipse/dot.project .project cp dev-tools/eclipse/dot.classpath .classpath cp -r dev-tools/eclipse/dot.settings .settings The 'ant eclipse' target does this setup. On 12/24/2012 10:45 PM, Xi Shen wrote: Hi Lance, I got the lucene 4 from http://mirror.bjtu.edu.cn/apache/lucene/java/4.0.0/lucene-4.0.0-src.tgz, it is an Ant project. But I do not which IDE can import it...I tried Eclipse, it cannot import the build.xml file. Thanks, D. On Mon, Dec 24, 2012 at 12:02 PM, Lance Norskog goks...@gmail.com wrote: You need to use an IDE. Find the Attribute type and show all subclasses. This shows a lot of rare ones and a few which are used a lot. Now, look at source code for various TokenFilters and search for other uses of the Attributes you find. This generally is how I figured it out. Also, after the full Analyzer stack is called, the caller saves the output (I guess to codecs?). You can look at which Attributes it saves. On 12/23/2012 06:30 PM, Xi Shen wrote: thanks a lot :) On Mon, Dec 24, 2012 at 10:22 AM, feng lu amuseme...@gmail.com wrote: hi Shen May be you can see some source code in org.apache.lucene.analysis package, such LowerCaseFilter.java,**StopFilter.java and so on. and some common attribute includes: offsetAtt = addAttribute(OffsetAttribute.**class); termAtt = addAttribute(**CharTermAttribute.class); typeAtt = addAttribute(TypeAttribute.**class); Regards On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć r@solr.pl wrote: Hello! The simplest way is to look at Lucene javadoc and see what implementations of Attribute interface there are - http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/** util/Attribute.htmlhttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/Attribute.html -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch thanks, i read this ready. it is useful, but it is too 'small'... e.g. for this.charTermAttr = addAttribute(**CharTermAttribute.class); i want to know what are the other attributes i need in order to implement my function. where i can find a references to these attributes? i tried on lucene solr wiki, but all i found is a list of the names of these attributes, nothing about what are they capable of... On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć r@solr.pl wrote: Hello! A small example with some explanation can be found here: http://solr.pl/en/2012/05/14/**developing-your-own-solr-**filter/http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/ -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hi, I need a guide to implement my own TokenFilter. I checked the wiki, but I could not find any useful guide :( --**--** - To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org --**--** - To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org -- Don't Grow Old, Grow Up... :-) --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to implement a TokenFilter?
You need to use an IDE. Find the Attribute type and show all subclasses. This shows a lot of rare ones and a few which are used a lot. Now, look at source code for various TokenFilters and search for other uses of the Attributes you find. This generally is how I figured it out. Also, after the full Analyzer stack is called, the caller saves the output (I guess to codecs?). You can look at which Attributes it saves. On 12/23/2012 06:30 PM, Xi Shen wrote: thanks a lot :) On Mon, Dec 24, 2012 at 10:22 AM, feng lu amuseme...@gmail.com wrote: hi Shen May be you can see some source code in org.apache.lucene.analysis package, such LowerCaseFilter.java,StopFilter.java and so on. and some common attribute includes: offsetAtt = addAttribute(OffsetAttribute.class); termAtt = addAttribute(CharTermAttribute.class); typeAtt = addAttribute(TypeAttribute.class); Regards On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć r@solr.pl wrote: Hello! The simplest way is to look at Lucene javadoc and see what implementations of Attribute interface there are - http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/Attribute.html -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch thanks, i read this ready. it is useful, but it is too 'small'... e.g. for this.charTermAttr = addAttribute(CharTermAttribute.class); i want to know what are the other attributes i need in order to implement my function. where i can find a references to these attributes? i tried on lucene solr wiki, but all i found is a list of the names of these attributes, nothing about what are they capable of... On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć r@solr.pl wrote: Hello! A small example with some explanation can be found here: http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/ -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hi, I need a guide to implement my own TokenFilter. I checked the wiki, but I could not find any useful guide :( - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Don't Grow Old, Grow Up... :-) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
I should not have added that note. The Opennlp patch gives a concrete example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not suggest it also records the end position. -Glen On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote: Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which stemmer?
Nope! This slang term only exists in the plural. The kind of prose with this usage may not follow standard grammatical and spelling rules anyway. Historically, text search has been funded mostly by the US intelligence agencies because they want to analyze formal and technical prose. And, it is coded by people who think in good grammar, and are perfect spellers. If you find 'too aggressive' and 'too mild' to be a problem, what you want is 'lemmatization' where you work from a dictionary of word forms. Solr supports using Wordnet for this purpose. Lance - Original Message - | From: Igal @ getRailo.org i...@getrailo.org | To: java-user@lucene.apache.org | Sent: Friday, November 16, 2012 4:18:20 PM | Subject: Re: Which stemmer? | | but if dogs are feet (and I guess I fall into the not-perfect group | here)... and feet is the plural form of foot, then shouldn't | dogs | be stemmed to dog as a base, singular form? | | | | On 11/16/2012 2:32 PM, Tom Burton-West wrote: | Hi Mike, | | Honestly I've never heard of anyone using dogs to mean feet | either, but | hey nobody's perfect. | | This is really off topic but I couldn't resist. This usage of | dogs to | mean feet occurs in old blues lyrics such as Blind Lemon | Jefferson's Hot | Dogs | http://www.youtube.com/watch?v=v670qVwzm9c | (Hard to make out what he's singing on the old 78, but he's says | his dogs | is red hot, meaning he can run really fast.) | http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/ | | Tom | | | | - | To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org | For additional commands, e-mail: java-user-h...@lucene.apache.org | | - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 delete by ID
Scott, did you mean the Lucene integer id, or the unique id field? - Original Message - | From: Martijn v Groningen martijn.v.gronin...@gmail.com | To: java-user@lucene.apache.org | Sent: Sunday, October 28, 2012 2:24:29 PM | Subject: Re: Lucene 4.0 delete by ID | | A top level document ID can change over time. For that reason you | shouldn't rely on it. However if you know your index is stable or you | keep track when a merge happes, you can use the | IndexWriter#tryDeleteDocument method to delete a document by Lucene | id. Deleting a document via a IndexReader is no longer possible. | | Martijn | | On 27 October 2012 01:47, Mossaab Bagdouri | bagdouri_moss...@yahoo.fr wrote: | Lucene document IDs are not stable. You could add a field with an | ID that | you maintain. Your query would then be just a TermQuery on the ID. | | Regards, | Mossaab | | | 2012/10/26 Scott Smith ssm...@mainstreamdata.com | | I'm currently converting some lucene code to 4.0. It appears that | you are | no longer allowed to delete a document by its ID. Is that | correct? Is my | only option to figure some kind of query (which obviously isn't | based on | ID) and do the delete from there? | | | | | -- | Met vriendelijke groet, | | Martijn van Groningen | | - | To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org | For additional commands, e-mail: java-user-h...@lucene.apache.org | | - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A large number of files in an index (3.6)
An option: instead of merging continuously as you run, you can optimize with 'maxSegments=10'. This mean 'optimize but only until there are 10 segments'. If there are fewer than 10 segments, nothing happens. This lets you schedule merging I/O. Is the number of files a problem due to file space breakage? - Original Message - | From: kiwi clive kiwi_cl...@yahoo.com | To: java-user@lucene.apache.org | Sent: Saturday, October 27, 2012 12:44:34 PM | Subject: A large number of files in an index (3.6) | | Hi guys, | | I've recently moved from lucene 2.3 to 3.6. The application uses CF | format. With lucene 2.3, I understood the interaction of merge | factor etc with repect to how many files were created in the index | directory. With a merge factor of 10, the number of files in the | index directory could sometimes get up to 30, but you can see the | merging happen and the numeber of files would roll up after a while | and settle around 10-15. | | | With lucene 3.6, this is not the case. Firstly, even with MergePolicy | set to useCFS, the index appears to be a hybrid of cfs and raw index | format. I can understand that may have been done for performance | reasons, but it does increase the file count considerably. Also the | rollup of the merged segments is not occurring as it did on the | previous version. Originally I set the CFSRatio to 1.0 and found | the behaviour similar to lucene2.3 (file number wise) but this came | at a i/o cost and the machines ran with a higher load average. The | higher i/o starts to affect query performance. Reducing cfsRatio to | 0.1 (default), helped reduce i/o load but I am running several | thousand concurrent indexes across many disks on the servers and | the larger number of files per index means a large number of files | are being opened when a query hits the index, in addition to the | indexing load. | | I'm sure this is probably down to Merge policies and schedules, but | there are quite a few knobs to tweak here so some guidance as to the | the most beneficial parameters to tweak would be very helpful. | | I'm using the LogByteSizeMergePolicy with 3 background merge threads. | I'm considering using TieredMergePolicy and even reducing the number | of merge threads, but there is not much point if it does not roll up | the segments as expected. I can tweak with the cfsRatio but this | strikes me a large hammer and there may be more subtle ways to do | this ! | | So tell me I'm being stupid, just say 'derr- why dont you do | this' and I'll be a happy man!! | | Thanks, | Clive - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Efficient string lookup using Lucene
The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines. After that, you can wildcards. This will use very little space. I believe leadingtrailing wildcards are supported now, right? On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin izavo...@caci.com wrote: The user uploads a set of text files, either all of them at once or one at a time, and then they will be searched locally on the phone against a set of hotlist words. This assumes no connection to any sort of server so everything must be done locally. I already have Lucene integrated so I might want to try the n-gram approach. But I just want to double-check first that it will work with any Unicode string, be it an English word, a foreign word, a sequence of digits or any random sequence of Unicode characters. In other words, this is not in any way language-dependent/-specific. Thanks, Ilya -Original Message- From: Dawid Weiss [mailto:dawid.we...@gmail.com] Sent: Sunday, August 26, 2012 3:55 AM To: java-user@lucene.apache.org Subject: Re: Efficient string lookup using Lucene Does Lucene support this type of structure, or do I need to somehow implement it outside Lucene? You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even obfuscated). By the way, I need this to run on an Android phone so size of memory might be an issue... How large is your input? Do you need to index on the android or just read the index on it? These are all factors to take into account. I mentioned suffix trees and suffix arrays because these two are canonical data structures to perform any substring lookups in constant time (in fact, the lookup takes the number of elements of the matched input string, building the suffix tree/ array is O(n), at least in theory). If you already have Lucene integrated in your pipeline then that n-gram approach will also work. If you know your minimum match substring length to be p then index p-sized shingles. For strings longer than p you can create a query which will search for all n-gram occurrences and take into account positional information to remove false matches. Dawid - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: easy way to figure out most common tokens?
You don't need to index the data. Just run the analyzer and maintain your own counters. This will be disk-bound and will run at your disk reading speed. On Sun, Aug 19, 2012 at 5:17 PM, Shaya Potter spot...@gmail.com wrote: On 08/19/2012 08:07 PM, Shaya Potter wrote: On 08/15/2012 02:34 PM, Ahmet Arslan wrote: Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html unsure how to use this as far as I can tell org.apache.lucene.misc.TermStats doesn't exist in lucene 3.6.1 (there seems to be some class like that in 4.x, but that doesn't help me). I'm wrong, its there, but eclipse isn't seeing it (haven't tried javac by itself), even though it sees HighFreqTerms just fine. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RE: Re:RE: Does the string Cla$$War affect Lucene?
- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RAM or SSD...
You do not want to store 30 G of data in the JVM heap, no matter what library does this. MMapDirectory does not store data in the JVM heap. It lets the operating system manage the disk buffer space. Even if the JVM says I have 30G of memory space, it really does not. It only has address space allocated by the OS but no memory. On Wed, Jul 18, 2012 at 10:39 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2012-07-18 at 17:50 +0200, Dragon Fly wrote: If I want to improve performance, which of the following is better and why? 1. Buy a machine with a lot of RAM and use a RAMDirectory for the index. As others has pointed out, MMapDirectory should work better than RAMDirectory. I am sure it will work fine with a relative small index such as yours. However, it does not scale that well with index size. 2. Put the index on a solid state drive. Why anyone buys computers without SSD's is a mystery to me. Use SSDs for the small low-latency stuff and a secondary spinning drive for the large slow stuff. Nowadays, a 30GB index (or 100GB for that matter) falls into the small low-latency bucket. SSDs speeds up almost everything, saves RAM and spares a lot of work hours optimizing I/O-speed. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Direct memory footprint of NIOFSDirectory
You can choose another directory implementation. On Thu, Jul 12, 2012 at 1:42 PM, Vitaly Funstein vfunst...@gmail.com wrote: Just thought I'd bump this. To clarify - for reasons outside my control, I can't just run the JVM hosting Lucene-enabled application with -XX:MaxDirectMemorySize=100G or some other huge value for the ceiling and never worry about this. Due to preallocation and other restrictions, this parameter has to be fairly close to the actual size used by the app (padded for Lucene and possibly other consumers). On Mon, Jul 9, 2012 at 7:59 PM, Vitaly Funstein vfunst...@gmail.com wrote: Hello, I have recently run into the situation when there was not a sufficient amount of direct memory available for IndexWriter to work. This was essentially caused by the embedding application making heavy use of JVM's direct memory buffers and not leaving enough headroom for NIOFSDirectory to operate. So what are the approximate guidelines, if any, in terms of JVM configuration for this choice of Directory to operate safely? Basically, what I am looking for is a rough estimate of direct memory usage per GB of indexed data, or per directory/writer instance, if applicable. Thanks, -V - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RAMDirectory with FSDirectory merging Versus large mergeFactor and RAMBufferSizeMB
Ramdirectory is no longer an interesting technique for this. It makes garbage collection do a lot of work. With memory-mapped directory the data is cached by the OS instead of Java, and OS is very good at this. TieredMergePolicy is much smarter about time spent merging segments. Lucene In Action 2 might be more help than a 6-year-old book :) On Mon, Jun 4, 2012 at 12:47 AM, Maxim Terletsky sx...@yahoo.com wrote: Hi guys, There are two approaches I see in Lucene In Action about speeding up the indexing process. 1) Simply increase the mergeFactor and RAMBufferSizeMB. 2) Using RAMDirectory as a buffer (perhaps even several in parallel) and later merging it using addIndexes to FSDirectory. So my question is the following: In case I have only 1 thread with RAMDirectory - is that pretty much the same as method 1? Since it's in memory anyhow for large mergeFactor and large RAMBufferSizeMB. Maxim -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene (search) performance tuning
Can you use filter queries? Filters short-circuit a lot of search processing. City:San Francisco is a classic filter - it is a small part of the documents and it is reused a lot. On Sat, May 26, 2012 at 7:32 AM, Yang tedd...@gmail.com wrote: I'm using disjunction (OR) query. unfortunately all of the clauses are optional On Sat, May 26, 2012 at 4:38 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Sat, May 26, 2012 at 2:59 AM, Yang tedd...@gmail.com wrote: I tested with more threads / processes. indeed this is completely cpu-bound, since running 1 thread gives the same latency as 4 threads (my box has 4 cores) given this, is there any way to simplify the scoring computation (i'm only using lucene as a first level rough search, so the search quality is not a huge issue here) , so that, for example, fewer fields are evaluated or a simpler scoring function is used? are you using disjunction or conjunction queries? Can you make some parts of the query mandatory? simon thanks Yang On Fri, May 25, 2012 at 5:47 PM, Yang tedd...@gmail.com wrote: thanks a lot guys On Tue, May 22, 2012 at 1:34 AM, Ian Lea ian@gmail.com wrote: Lots of good tips in http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, linked from the FAQ. -- Ian. On Tue, May 22, 2012 at 2:08 AM, Li Li fancye...@gmail.com wrote: something wrong when writing in my android client. if RAMDirectory do not help, i think the bottleneck is cpu. you may try to tune jvm but i do not expect much improvement. the best one is splitting your index into 2 or more smaller ones. you can then use solr s distributed searching. if the cpu is not fully used, yuo can do this in one physical machine 在 2012-5-22 上午8:50,Li Li fancye...@gmail.com写道: 在 2012-5-22 凌晨4:59,Yang tedd...@gmail.com写道: I'm trying to make my search faster. right now a query like name:Joe Moe Pizza address:77 main street city:San Francisco is this a conjunction query or a disjunction query? in a index with 20mil such short business descriptions (total size about 3GB) takes about 100--200ms. 20m is not a small size, how many results for a query in average? I profiled the query, most time is spent in TermScorer.score(), as is shown by the attached yourkit screenshot. that's true, for a query, matching and scoring is very time consuming and cpu intensive. another one is io for reading postings. I tried loading the index onto tmpfs (in-memory block device), and also tried RAMDirectory, neither helps much. if that is true. it seems that io is not the I am reading http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf it mentions Size – Stopword removal – Stemming • Lucene has a number of stemmers available • Light versus Aggressive • May prevent fine-grained matches in some cases – Not a linear factor (usually) due to index compression so for stopword removal, I'm already using the standard analyzer, so stop word removal is already included, right? also generally any other tricks to try for reducing the search latency? Thanks! Yang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene (search) performance tuning
And, no RamDirectory does not help. On Mon, May 28, 2012 at 5:54 PM, Lance Norskog goks...@gmail.com wrote: Can you use filter queries? Filters short-circuit a lot of search processing. City:San Francisco is a classic filter - it is a small part of the documents and it is reused a lot. On Sat, May 26, 2012 at 7:32 AM, Yang tedd...@gmail.com wrote: I'm using disjunction (OR) query. unfortunately all of the clauses are optional On Sat, May 26, 2012 at 4:38 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Sat, May 26, 2012 at 2:59 AM, Yang tedd...@gmail.com wrote: I tested with more threads / processes. indeed this is completely cpu-bound, since running 1 thread gives the same latency as 4 threads (my box has 4 cores) given this, is there any way to simplify the scoring computation (i'm only using lucene as a first level rough search, so the search quality is not a huge issue here) , so that, for example, fewer fields are evaluated or a simpler scoring function is used? are you using disjunction or conjunction queries? Can you make some parts of the query mandatory? simon thanks Yang On Fri, May 25, 2012 at 5:47 PM, Yang tedd...@gmail.com wrote: thanks a lot guys On Tue, May 22, 2012 at 1:34 AM, Ian Lea ian@gmail.com wrote: Lots of good tips in http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, linked from the FAQ. -- Ian. On Tue, May 22, 2012 at 2:08 AM, Li Li fancye...@gmail.com wrote: something wrong when writing in my android client. if RAMDirectory do not help, i think the bottleneck is cpu. you may try to tune jvm but i do not expect much improvement. the best one is splitting your index into 2 or more smaller ones. you can then use solr s distributed searching. if the cpu is not fully used, yuo can do this in one physical machine 在 2012-5-22 上午8:50,Li Li fancye...@gmail.com写道: 在 2012-5-22 凌晨4:59,Yang tedd...@gmail.com写道: I'm trying to make my search faster. right now a query like name:Joe Moe Pizza address:77 main street city:San Francisco is this a conjunction query or a disjunction query? in a index with 20mil such short business descriptions (total size about 3GB) takes about 100--200ms. 20m is not a small size, how many results for a query in average? I profiled the query, most time is spent in TermScorer.score(), as is shown by the attached yourkit screenshot. that's true, for a query, matching and scoring is very time consuming and cpu intensive. another one is io for reading postings. I tried loading the index onto tmpfs (in-memory block device), and also tried RAMDirectory, neither helps much. if that is true. it seems that io is not the I am reading http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf it mentions Size – Stopword removal – Stemming • Lucene has a number of stemmers available • Light versus Aggressive • May prevent fine-grained matches in some cases – Not a linear factor (usually) due to index compression so for stopword removal, I'm already using the standard analyzer, so stop word removal is already included, right? also generally any other tricks to try for reducing the search latency? Thanks! Yang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sort runs out of memory
The Trie type can be tuned for range queries v.s. single queries. This seems to be explained in email and nowhere else: http://www.lucidimagination.com/search/document/c501f59515a9eece On Mon, May 21, 2012 at 12:54 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Thu, 2012-05-17 at 23:03 +0200, Robert Bart wrote: I am running Lucene 3.6 in a system that indexes about 4 billion documents across several indexes, and I'm hoping to get documents in order of a certain NumericField. What is the maximum size on any single index, in terms of number of documents? What is the type of the NumericField? I've tried using Lucene's Sort implementation, but it looks like it tries to do the entire sort in memory by allocating a huge array with space for every document in the index. The FieldCache allocates an array of length #documents with the same type that your NumericField is. The sort itself is of the sliding window type, meaning that it only takes up memory lineary to the number of documents wanted in the response. Do you require millions of documents to be returned as part of a search? Sanity check: You do specify the type when performing a sorted search, right? If not, the values will be treated as Strings. On my index, this quickly runs out of memory. Assuming that your largest index is 1B documents and that your NumericField is of type Integer, the FieldCache's values for the sort should take up 1B * 4 = 4GB. Are you hoping for less? Are there any alternatives or better ways of getting documents in order of a NumericField for a very large index? Be sure to select the type of NumericField to be as small as possible. If you have few unique sort values (e.g. 17, 80, 2000 and 5678), you might map them down (to 0, 1, 2 and 3 for this example) and store them as a byte. Currently Lucene only supports atomic types for numerics in the FieldCache, so the smallest one is byte. It is possible to use only ceil(log2(#unique_values)) bits/document, although that requires a bit of custom coding. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Clear/Remove attribute from Token
I would like to remove a payload attribute from a token before it is indexed. PayloadAttribute lets you set the payload to null. AttributeSource (parent of all Tokens) does not have a 'remove Attribute' method. You cannot capture the current attribute set with 'getState()' and then monkey with it (at least Eclipse does not show me its methods). If I set the payload to null, when the Token is saved in the index, will a null payload be saved? Or does the payload get quietly dropped? -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Clear/Remove attribute from Token
With more hunting, the code for this is in org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(int, int) The next question is: does a Token need a PositionIncrementAttribute to be written out? Or can I just tack on a Payload and that is it? Does it need an Offset also? On Mon, May 14, 2012 at 1:09 AM, Lance Norskog goks...@gmail.com wrote: I would like to remove a payload attribute from a token before it is indexed. PayloadAttribute lets you set the payload to null. AttributeSource (parent of all Tokens) does not have a 'remove Attribute' method. You cannot capture the current attribute set with 'getState()' and then monkey with it (at least Eclipse does not show me its methods). If I set the payload to null, when the Token is saved in the index, will a null payload be saved? Or does the payload get quietly dropped? -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Here a merge thread, there a merge thread ...
Solr uses TieredMergeScheduler by default now. You might find this works more smoothly. On Fri, Feb 24, 2012 at 10:03 AM, Benson Margulies bimargul...@gmail.com wrote: On Fri, Feb 24, 2012 at 10:59 AM, Michael McCandless luc...@mikemccandless.com wrote: This is from ConcurrentMergeScheduler (the default MergeScheduler). But, are you sure the threads are sleeping, not exiting? (They should be exiting). This merge scheduler starts a new thread when a merge is needed, allows that thread to do another merge (if one is immediately available), else the thread exits. They seem to exit eventually, but not quite as soon as they arrive. Mike McCandless http://blog.mikemccandless.com On Sun, Feb 19, 2012 at 9:05 PM, Benson Margulies bimargul...@gmail.com wrote: A long-running program of mine (which Uwe's read a model of) slowly keeps adding merge threads. I count 22 at the moment. Each one shows up, runs for a bit, and then goes to sleep for, seemingly ever. I don't do anything explicit to control merging behavior. They name themselves Lucene Merge Thread #xxx where xxx is a non-contiguous but ever-growing number. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Retrieving large numbers of documents from several disks in parallel
Is each index optimized? From my vague grasp of Lucene file formats, I think you want to sort the documents by segment document id, which is the order of documents on the disk. This lets you materialize documents in their order on the disk. Solr (and other apps) generally use a separate thread per task and separate index reading classes (not sure which any more). As to the cold-start, how many terms are there? You are loading them into the field cache, right? Solr has a feature called auto-warming which automatically runs common queries each time it reopens an index. On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht p...@hoplahup.net wrote: Michael, from a physical point of view, it would seem like the order in which the documents are read is very significant for the reading speed (feel the random access jump as being the issue). You could: - move to ram-disk or ssd to make a difference? - use something different than a searcher which might be doing it better (pure speculation: does a hit-collector make a difference?) hope it helps. paul Le 22 déc. 2011 à 03:45, Robert Bart a écrit : Hi All, I am running Lucene 3.4 in an application that indexes about 1 billion factual assertions (Documents) from the web over four separate disks, so that each disk has a separate index of about 250 million documents. The Documents are relatively small, less than 1KB each. These indexes provide data to our web demo (http://openie.cs.washington.edu), where a typical search needs to retrieve and materialize as many as 3,000 Documents from each index in order to display a page of results to the user. In the worst case, a new, uncached query takes around 30 seconds to complete, with all four disks IO bottlenecked during most of this time. My implementation uses a separate Thread per disk to (1) call IndexSearcher.search(Query query, Filter filter, int n) and (2) process the Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like a long time to retrieve 3,000 small Documents, I am wondering if I am overlooking something simple somewhere. Is there a better method for retrieving documents in bulk? Is there a better way of parallelizing indexes from separate disks than to use a MultiReader (which doesn’t seem to parallelize the task of materializing Documents) Any other suggestions? I have tried some of the basic ideas on the Lucene wiki, such as leaving the IndexSearcher open for the life of the process (a servlet). Any help would be greatly appreciated! Rob - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: semanticvectors
It's kind of a bazooka, but the Mahout project has support for this. https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation On Tue, Aug 30, 2011 at 6:24 AM, zarrinkalam f_z...@yahoo.com wrote: Dear paul, did you use semanticvectors? I couldn't find appropriate help zarrinkalam From: Paul Libbrecht p...@hoplahup.net To: java-user@lucene.apache.org Sent: Monday, August 29, 2011 7:28 PM Subject: Re: LSI Zarrinkalam, have a look at semanticvectors. paul Le 29 août 2011 à 15:55, zarrinkalam a écrit : hi, I want to use LSI for clustring ducuments indexed with lucene, I dont know how, plz help me thanks, - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com
Re: RAMDirectory doesn't win over FSDirectory all the time, why?
The RAMDirectory uses Java memory, an FSDirectory does not. Holding Java memory makes garbage collection work harder. The operating system is very very good at managing disk buffers, and does a better job using spare memory than Java does. For real-world sites, RAMDirectory is almost always useless. Maybe the Instantiated index stuff is more what you want? Lance On Tue, Jun 7, 2011 at 2:52 AM, zhoucheng2008 zhoucheng2...@gmail.com wrote: Makes sense. Thanks -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Tuesday, June 07, 2011 4:28 PM To: java-user@lucene.apache.org Subject: Re: RAMDirectory doesn't win over FSDirectory all the time, why? On Mon, 2011-06-06 at 15:29 +0200, zhoucheng2008 wrote: I read the lucene in action book and just tested the FSversusRAMDirectoryTest.java with the following uncommented: [...]Here is the output: RAMDirectory Time: 805 ms FSDirectory Time : 728 ms This is the code, right? http://java.codefetch.com/example/in/LuceneInAction/src/lia/indexing/FSversusRAMDirectoryTest.java The test is problematic as the same two tests run sequentially. If you change long ramTiming = timeIndexWriter(ramDir); long fsTiming = timeIndexWriter(fsDir); to long fsTiming = timeIndexWriter(fsDir); long ramTiming = timeIndexWriter(ramDir); my guess is that RAMDirectory will be faster. For a better comparison, perform each test in separate runs (make a test class just for RAMDirectory and one just for FSDirectory, then run them one at a time, each in its own JVM). One big problem when comparing RAMDirectory to file-access is caching. What you measure with a test might not be what you see in production, as the production index might be large compared to RAM available for file caching. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Solr 1.4.1: Weird query results
Look at the text definition stack. Does it have the same analyzer and filter that you used to make the index, and in the same order? The specific problem is that the text field includes a stemmer, and your code probably did not. And so marine is stored as, maybe 'marin'. To check this out, look at the 'schema browser' page off the admin page. This will show you all of the indexed terms in each field. Also look at the Analysis page: this lets you see how text is parsed and changed in the analysis stack. On Tue, Apr 19, 2011 at 2:56 PM, Erick Erickson erickerick...@gmail.com wrote: H, I don't see the problem either. It *sounds* like you don't really have the default search field defined the way you think you do. Did you restart Solr after making that change? I'm assuming that when you say not created by Solr you mean that it's created by Lucene. What version of Lucene and Solr are you using if that's true? You can test this by appending debugQuery=on to your query or checking the debug enable checkbox in the full query interface from the admin page. That should show you exactly what is being searched. You might also want to look at the analysis page for your field and see how your query is tokenized. But, like I said, this looks like it should work. If you can post the results of adding debugQuery=on and your actual fieldType definition for text_ws your field declaration for text and the defaultSearchField from your schema that would help. I can't tell you how many times something that's eluded me for hours is obvious to someone else :).. Best Erick On Tue, Apr 19, 2011 at 3:59 PM, Erik Fäßler erik.faess...@uni-jena.de wrote: Hallo there, my issue qualifies as newbie question I guess, but I'm really a bit confused. I have an index which has not been created by Solr. Perhaps that's already the point although I fail to see why this should be an issue with my problem. I use the admin interface to check which results particular queries bring in. My index documents have a field text which holds the document text. This text has only been white space tokenized. So in my schema, the type for this field is text_ws. My schema says defaultSearchFieldtext/defaultSearchField. When I now search for, say, 'marine' (without quotes), I don't get any search results. But when I search 'marine' (that is, embraced by double quotes) I get my document hits. Alternatively, I can prepend the field name: 'text:marine' and will also get my results. Similar with this phrase query: marine mussels, where In marine mussels of the genus is a text snippet of a document. The phrase marine mussels won't give any hits. Searching for 'text:marine mussels' will give me the exact document containing this text snippet. I'm sure this has quite a simple explanation but I'm unable to find it right now ;-) Perhaps you can help with that. Thanks a lot! Best regards, Erik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RE: ParallelMultisearcher
-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Help!
Check out the Mahout project: mahout.apache.org - there is a lucene-based text classifier project in there. Lance On Tue, Mar 1, 2011 at 9:25 PM, Sundus Hassan sundushas...@gmail.com wrote: I am doing MS-Thesis on content-based text categorization. For This purpose I intend to use LUCENE.I need some help/tutorial/guide regarding: 1) How to build and deploy LUCENE? 2) Some basic information regarding working of Lucene? 3) How to use LUCENE in my project? Will be looking forward for response. Thanks in advance. -- Regards, Sundus Hassan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene to search live, being-edited documents
- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: 3.0.3 Contrib Query Parser : Custom Field Name Builder
Bravo! On Fri, Jan 7, 2011 at 10:39 PM, Adriano Crestani adrianocrest...@gmail.com wrote: I created a JIRA to fix this problem: https://issues.apache.org/jira/browse/LUCENE-2855 On Sat, Jan 8, 2011 at 1:32 AM, Adriano Crestani adrianocrest...@gmail.comwrote: Hi Christopher, Thanks for raising this problem, I always thought a little bit strange to use CharSequence as map key. Then a just did a little bit of research and found this on CharSequence javadoc: This interface does not refine the general contracts of the equalshttp://download.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#equals(java.lang.Object) and hashCodehttp://download.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#hashCode() methods. The result of comparing two objects that implement CharSequence is therefore, in general, undefined. Each object may be implemented by a different class, and there is no guarantee that each class will be capable of testing its instances for equality with those of the other. It is therefore inappropriate to use arbitrary CharSequence instances as elements in a set or as keys in a map. So I think every Set or Map that uses CharSequence on contrib queryparser should be forced to use String instead. I think there is no need to change any API, we just need to make sure that toString() is invoked on the CharSequence object before adding it to any Set or Map, this way we can fix this problem for next 3.x release. However, for 4.x, we should ideally change every API that receives or return MapCharSequence,... or SetCharSequence to use only String. On Fri, Jan 7, 2011 at 8:44 PM, Christopher St John ckstj...@gmail.comwrote: I'm trying to: StandardQueryTreeBuilder b = …; b.setBuilder( myfield, fieldSpecificBuilder); In the debugger I see that the builder is registered in the QueryTreeBuilder's fieldNameBuilders map. When parsing, QueryTreeBuilder.getBuilder tries to look up the builder by using the FieldableNode's field but the debugger says the node's field is an UnescapedCharSequence, not a String, and the lookup fails. Registering the builder with an UnescapedCharSequence for the name instead of a String doesn't seem to help, presumably because UCS doesn't have a hash an equals or hash method. Suggestions? I've worked around it by registering a class based builder, checking for the field name and either delegating to the original builder or doing my custom processing, but it's a little awkward. -cks -- Christopher St. John http://artofsystems.blogspot.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene/Solr for Plagiarism detection
The MoreLikeThis feature may be exactly what you want. Try it out. On Thu, Dec 30, 2010 at 8:28 AM, Amel Fraisse amel.frai...@gmail.com wrote: Hello, No I'm not using cosine similarity metrics. 2010/12/30 Shashi Kant sk...@sloan.mit.edu Have you considered using document similarity metrics such as Cosine Similarity? On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse amel.frai...@gmail.com wrote: Hello, I am using Lucene for plagiarism detection. The goal is that: when I have a new document, I will check on the solr index if there is a document that contain some common chunk. So to compute similarity between the query and a source document I would use this formula : Score (suspicious document, source document) = Number of common chunk between source document and suspicious document / Number of total chunk in the suspicious document. So I have to change the scoring formula in the Similarity class. How can I change the scoring formula? ( by customizing only the Similarity class? or Scorer?) Do you have an Example of this use case? Thank for your help. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- -- Amel Fraisse -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene to search live, being-edited documents
Check out the Instantiated contrib for Lucene. This is an alternative in-memory data structure that does not need commits and is faster (and larger) than the Lucene Directory system. On Wed, Dec 29, 2010 at 9:15 AM, adam.salt...@gmail.com wrote: What has this to do with Lucene? You're thinking its index would be faster than your own search algorithm. Would it though? Do you really need an index or a good pattern matcher? I can't see what the stream buffer being flushed by the user has to do with it? Don't you have to control that behaviour? Sent using BlackBerry® from Orange -Original Message- From: software visualization softwarevisualizat...@gmail.com Date: Wed, 29 Dec 2010 11:55:17 To: java-user@lucene.apache.org; adam.salt...@gmail.com Reply-To: softwarevisualizat...@gmail.com Subject: Re: Using Lucene to search live, being-edited documents I am writing a text editor and have to provide a certain search functionality . The use case is for single user. A single document is potentially very large and numerous such documents may be open and unflushed at any given time. Think many files of an IDE, except the files are larger. The user is free to change, say, variables names across documents which may be separate files opened simultaneously in a variety of tabs (say) and being edited with no guarantee that the user has flushed or saved any of it. On Wed, Dec 29, 2010 at 10:37 AM, adam.salt...@gmail.com wrote: This is interesting. What are we driving at here? A single user? That doesn't make sense to unless you want to flag certain things as they construct the document. Or else why don't they know what is in their own document? There must be other ways apart from Lucene. It seems to me you want each line parsed as soon as entered and matched against some criteria. I would look at plugins for Open Office first. Or any other text editor. But not sure you have given enough information. Sent using BlackBerry® from Orange -Original Message- From: Sean spaceh...@foxmail.com Date: Wed, 29 Dec 2010 15:35:17 To: java-userjava-user@lucene.apache.org Reply-To: java-user@lucene.apache.org Subject: Re:Using Lucene to search live, being-edited documents Does it make any sense? Every time a search result is shown, the original document could have been changed, no matter how fast the indexing speed is. If you can accept this inconsistency, you do not need to index so frequently at all. -- Original -- From: software visualizationsoftwarevisualizat...@gmail.com; Date: Wed, Dec 29, 2010 06:06 AM To: java-userjava-user@lucene.apache.org; Subject: Using Lucene to search live, being-edited documents This has probably been asked before but I couldn't find it, so... Is it possible / advisable / practical to use Lucene as the basis of a live document search capability? By live document I mean a largish document such as a word processor might be able to handle which is being edited currently. Examples would be Word documents of some size that are begin written, really huge Java files, etc. The user is sitting there typing away and of course everything is changing in real time. This seems to be orthogonal to the idea of a Lucene index which is costly to construct and costly to update. TIA -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: asking about index verification tools
The Lucene CheckIndex program does this. It is a class somewhere in Lucene with a main() method. Samarendra Pratap wrote: It is not guaranteed that every term will be indexed. There is a limit on maximum number of terms (as in lucene 3.0 and may be earlier too) per field. Check out this http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int) On Tue, Nov 16, 2010 at 11:36 AM, Yakobjacob...@opensuse-id.org wrote: hello all, I would like to ask about lucene index. I mean I created a simple program that created lucene indexes and stored it in a folder. also I had use a diagnostic tools name Luke to be able to lurk inside lucene index and find out its content. and I know that lucene is a standard framework when it come to building a search engine. but I just wanted to be sure that lucene indexes every term that existed in a file. I mean is there a way for me or some tools out there to verify that the index contains in lucene indexes is dependable? and not a single term went missing there? I know that this is subjective question but I just wanted to hear your two cents. thanks though. :-) tl;dr: how can we know that the index in lucene is correct? -- http://jacobian.web.id - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is the best Analyzer and Parser for this type of question?
First, to understand what your query looks like, go to admin/analysis.jsp. It lets you see what happens to your queries when they go in. Then, do the query with debugQuery=true. This will add some complex junk to the end of the XML page that describes in painful detail exactly how each document was scored. After all that- you might have a problem with the PrnP etc. stuff getting chopped up in weird ways. I don't know how people handle this in chemistry/bio search. Lance Ahmet Arslan wrote: Example of Question: - What is the role of PrnP in mad cow disease? First thing is do not directly query questions. Manually formulate queries: remove 'what' 'is' 'the' 'of' '?' etc. For example i would convert this question into: mad cow^5 cow disease^3 mad cow disease^15 role PrnP~5^2 role mad cow disease~45 mad^0.1 role^0.5 cow disease PrnP^10 I am running in 11.638 documents and the result is 10410 docs for this question (loww precision) Use OR default operator, collect and evaluate top 1000 documents only. And instead of Porter you can try KStem. http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi Try different length normalization described here. Also their Lucene query example (SpanNear) can inspire you. http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Can I use Lucene for this?
The Lucene MoreLikeThis tool in lucene/contrib/similar will do one variant of what you want. You can do this particular test in Solr- you'll find it much much easier to put together. For other text similarities, you'll have to code them directly. Lance On Sat, Nov 13, 2010 at 7:07 AM, Shashi Kant sk...@sloan.mit.edu wrote: There are multiple measures of similarity for documents: Cosine similarity is a frequently used one. On Sat, Nov 13, 2010 at 9:23 AM, Ciprian URSU ursu@gmail.com wrote: Hi Guys, I just find out about Lucene; after reading the main things on wiki it seems to be a great tool, but I still didn't find out how can I use it for my needs. What I want to do is a small tool which has some documents (mainly text) inside and then when I have a new document as input, to compare it with all those which are stored and to give me back as a percentage of similarity. I have read this part: http://wiki.apache.org/lucene-java/ScoresAsPercentages but it is not yet very clear to me how to use Lucene for that. Is it possible that some of you have a sample code for that? Thanks a lot, and I apologize for the fact that for many of you this looks like a stupid post :). Best Regards, Ciprian. -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to handle more than Integer.MAX_VALUE documents?
You would have to control your MergePolicy so it doesn't collapse everything back to one segment. On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer simon.willna...@googlemail.com wrote: On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog goks...@gmail.com wrote: 2billion is a hard limit. Usually people split indexes into multiple index long before this, and use the parallel multi reader (I think) to read from all of the sub-indexes. On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, Now lucene uses integer as document id, so it means we cannot have more than 2^31-1 documents within one collection? Even if we use MultiSearcher the document id is still integer so it seems this is still a problem? This is really the limit of a segment, I think you can write you own collector and collect documents which higher (absolute) doc ids than INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you should really rethink the way your search works and apply some sharding techniques. I really haven't been up to that many docs in a single index but I think it should work to have multiple segments with INT_MAX documents in it since we search on segment level provided if you collector supports it. simon We have been using lucene for some time and our document count is growing rather rapidly, maybe this is a much-discussed issue already, but I did not find the lead, any pointer would be really appreciated. Thanks very much for helps, Lisheng - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to handle more than Integer.MAX_VALUE documents?
2billion is a hard limit. Usually people split indexes into multiple index long before this, and use the parallel multi reader (I think) to read from all of the sub-indexes. On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, Now lucene uses integer as document id, so it means we cannot have more than 2^31-1 documents within one collection? Even if we use MultiSearcher the document id is still integer so it seems this is still a problem? We have been using lucene for some time and our document count is growing rather rapidly, maybe this is a much-discussed issue already, but I did not find the lead, any pointer would be really appreciated. Thanks very much for helps, Lisheng - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Email Indexing
Tika has some mailbox file parsing that includes metadata parsing. For POP/IMAP email servers I don't know any tools. Hasan Diwan wrote: On 27 October 2010 18:16, Troy Wicalt...@wical.com wrote: Depends on what your trying to index, I suppose. Maildir or mbox? For some time now, off and on, I have been working to index an ezmlm mailing list archive. In the end, I went with Swish-E and have made quite a bit of progress. I am short of my complete goal though. The issue is that the search results do not return results that contain the subject, and there is currently no excerpt or phrase highlighting. My problem is the flat text email files I am working with have no xml or anything to help the indexer create fields from. I've not yet figured out how to convert the emails to xml. Neither Maildir or mbox -- IMAP/POP doesn't care. Basically, I want to build the index based on the contents of (my) gmail box. I can retrieve the messages using IMAP, just need to figure out the structure of the index. Converting email to XML? Email me off-list and I'll provide you with some help (as email = XML has little to do with lucene). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Text categorization / classification
There are tools for this in the Mahout project. These are oriented toward large-scale work. http://mahout.apache.org There is a big learning curve and you have to learn Hadoop somewhat. The book 'Collective Intelligence' includes a suite of Python tools for small-scale experiments. On Wed, Oct 27, 2010 at 1:12 PM, Maria Vazquez mvazq...@ova.st wrote: I need to auto-categorize a large number of documents. They are basically news articles from major news sources (nytimes, npr, abcnews, etc). I'd like to categorize them automatically. Any suggestions? Lucene in Action suggests using a set of documents to build category vectors and then comparing each document to each of those vectors and get the closest one. The approach seems pretty simple (from other papers I read on text categorization) but maybe you guys know of something out there that already does this using Lucene/Solr. Thanks! Maria - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to export lucene index to a simple text file?
The Lucene CheckIndex program opens an index and walks all of the data structures. It is a good start for you. Sahin Buyrukbilen wrote: Thank you Uwe, I will read the docs and try to do it, however do you have an example code? I need because I am not very familiar with Java. Thank you. Sahin On Tue, Sep 21, 2010 at 12:29 PM, Uwe Schindleru...@thetaphi.de wrote: Hi, Retrieve a TermEnum and iterate it. By that you get all terms and can retrieve the docFreq, which is the second column in your table. Finally for each term you position the TermDocs enum on this term to get all document ids. Read docs of IndexReader/TermEnum/TermDocs about this. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Sahin Buyrukbilen [mailto:sahin.buyrukbi...@gmail.com] Sent: Tuesday, September 21, 2010 9:12 AM To: java-user@lucene.apache.org Subject: How to export lucene index to a simple text file? Hi, I am currently working on a project about private information retrieval and I need to have an inverted index file in txt format as follows: Term tfreq t Inverted list for t - and 16, 0.159 big 22, 0.148 3, 0.088 dark 16, 0.079 . . . . here thenumber1, number2 pairs are indicating: number1: doc ID, where term t exist with a rank of number2. I have created an index from 5492 txt files, however the index is composed of different files and most of the data is not in the text format. could somebody guide me to achieve this? Thank you Sahin. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Checksum and transactional safety for lucene indexes
If an index file is not completely written to disk, it never become available. Lucene has a file describing the current active index segments. It writes all new files to the disk, and changes the description file (segments.gen) only after that. If the index files are corrupted, all bets are off. Usually the data structures are damaged and Lucene throws CorruptIndexExceptions, NPE or array out-of-bounds exceptions. There is no checksumming of the index files. Lance Pulkit Singhal wrote: Hello Everyone, What happens if: a) lucene index gets written half-way to the disk and then something goes wrong? b) the index gets corrupted on the file system? When we open that directory location again using FSDirectory implementations: a) Is there any provision for the code to clean out the previous file and start a new index file because the older one was corrupted and didn't match the checksum? b) Or can we check that the # of documents that can be found in the underlying index are now ZERO because they can't be parsed properly? How can we do this? - Pulkit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Connection question
This can probably be done. The hardest part is cross-correlating your Lucene analyzer use with the Solr analyzer stack definition. There are a few things Lucene does that Solr doesn't- span queries for one. Lance On Fri, Sep 17, 2010 at 12:39 PM, Christopher Gross cogr...@gmail.com wrote: Yes, I'm asking about network connections. Are you aware of any documentation on how I can set up Solr to use the Lucene index that I already have? Thanks! -- Chris On Fri, Sep 17, 2010 at 3:02 PM, Ian Lea ian@gmail.com wrote: Are you asking about network connections? There is no networking built into lucene. There is in solr, and lucene can use directories on networked file systems. -- Ian. On Fri, Sep 17, 2010 at 6:08 PM, Christopher Gross cogr...@gmail.com wrote: I'm trying to connect to a Lucene index on a test server. All of the examples that I've found use a local directory to connect into the Lucene index, but I can't find one that will remotely hook into it. Can someone please point me in the right direction? I'm fairly certain that someone has run into and fixed this problem, but I haven't been able to find a way to do it. Thanks for your help! -- Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extra Analyzers
Please start a new thread instead of highjacking this one. 2010/9/10 Iam Jabour iamjab...@gmail.com: Hi, I got lucene from http://www.apache.org/dyn/closer.cgi/lucene/java/ but I'm looking for extra Analyzers like BrazilianAnalyzer [1] and others. Where can I get extra packages for lucene? Ty [1] - http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers/3.0.2/org/apache/lucene/analysis/br/BrazilianAnalyzer.java/ __ Iam Jabour On Thu, Sep 9, 2010 at 2:23 AM, fulin tang tangfu...@gmail.com wrote: we now have 0.15 billion documents, which source size 1.5 TB, on 16 shards . I am very interested how you get your job done 梦的开始挣扎于城市的边缘 心的远方执着在脚步的瞬间 我的宿命埋藏了寂寞的永远 2010/8/26 Nigel nigelspl...@gmail.com: I'm curious about what the largest Lucene installations are, in terms of: - Greatest number of documents (i.e. X billion docs) - Largest data size (i.e. Y terabytes of indexes) - Most machines (i.e. Z shards or severs) Apart from general curiosity, the obvious follow-up question would be what approaches were taken to scale to extremes. We have ~11 billion documents indexed (growing at 2 billion per month), but I'm sure someone else has enough that this appears puny. (-: Thanks, Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sorting a Lucene index
It is also possible to sort by function. This allows you to avoid storing an array of 1 int for all documents. It is slower than the raw Lucene sort. On Wed, Aug 25, 2010 at 1:46 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2010-08-25 at 07:16 +0200, Shelly_Singh wrote: I have 1 bln documents to sort. So, that would mean ( 8 bln bytes == 8GB RAM) bytes. All I have is 8 GB on my machine, so I do not think approach would work. This implies that your numeric value can be more than 2 billion. Are you sure that is true? First suggestion (simple): Ensure that your sort field is stored and sort by requesting the value for each document in the search result. This works okay when the number of hits is small. Second suggestion (complex): Make an int-array with the sort-order of your documents. This takes 4GB and needs to be calculated fully before use, which will take time. After that sorted searches will be very fast and handle a large number of hits well. You can let your indexer maintain the sort-array so that the existing order ban be re-used when adding documents. Whether modifying an existing order-array is cheaper than a full re-sort or not depends on your batch size. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene applicability
A stepping stone to the above is that, in DB terms, a Lucene index is only one table. It has a suite of indexing features that are very different from database search. The features are oriented to searching large bodies of text for ideas rather than concrete words. It searches a lot faster than a DB. It also spends more time creating its various indexes than a DB. Other points- you can't add or drop fields or indexes. On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson erickerick...@gmail.com wrote: The SOLR wiki has lots of good information, start there: http://wiki.apache.org/solr/ Otherwise, see below... On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang wolfgang.schrei...@itsv.at wrote: Hi all, We are currently evaluating potential search frameworks (such as Hibernate Search) which might be suitable to use in our project (using Spring, JPA with Hibernate) ... I am sending this E-Mail in hope you can advise me on a few issues that would help us in our decision making process. 1.) Is Lucene suitable for full text database searches? I read Lucene was designed to index and search documents but how does it behave querying relational data sets in general? Let's start be talking about the phrase full text database searches. One thing virtually all db-centric people trip over is trying to use SOLR as if it were a database. You just can't think about tables. The first time you think about using SOLR to do something join-like, stop and take a deep breath and think about documents instead. The general approach is to flatten your data so that each document contains all the relevant info. Yes, this leads to de-normalization. Yes, denormalized data makes a good DBA cringe. But that's the difference between searching and using a RDBMS. Document is somewhat misleading. A document in SOLR terms is just a collection of fields. And, BTW, there's no requirement that each document have the same fields (very unlike a DB). 2.) Can we make assumptions on query performance considering combined searches, range queries or structured data and wildcard searches? If we consider a data structure consisting of say 3 tables and each table contains a few million entries (e.g. first name, last name and address fields) and we search for common values (such as 'John', 'Smith' and 'New York') where a. each value for itself and each combination would result in millions of hits Sure, but what those assumptions are is totally dependent on how you've set things up. SOLR has been successfully used on several billion document indexes. There are tools for making all that work (i.e. replication, sharding, etc) built into SOLR. So I suspect you can make things work. Several million documents is not that large a data set. As always, there are tradeoffs between speed and complexity. But from what you've described I see no show stoppers. b. a person can have multiple first names and we want to make sure to receive any combination of the last name with any first name This just sounds like an OR. But the queries can be pretty complex queries. Some examples of what you expect would help. See multi-valued fields. So, a document can have multiple firstname entries. Again, not like a DB (your reflexes will trip you up on this point G). c. we search for a last name and a range of birth dates Sure, range queries work just fine. Note that dates can trip you up, look at triedate if you experiment. 3.) Transaction safety: How does Lucene handle indexes? If we update data model and index, what happens to the index if anything goes wrong as soon as the data model has been persisted? A lot of work has been done to make SOLR quite robust if anything goes wrong. That said, how are you backing up your data? That is, what is the source of the data you're going to index? If you're relying on your SOLR index to be your backup, you simply must back it up somewhere often enough to get by if your building burns down. I'd also think about storing your original input... This is no different than a DB. you have to guard against the disk crashing, someone walking by with a powerful magnet, earthquake, flood, fires G. Do note that if you modify your index schema, no existing documents reflect the new schema, you have to reindex them. I hope I made the issues clear to you, just some general thoughts about how Lucene would behave in a real world application scenario ... Any support or pointers to helpful documents or Web links are highly appreciated! Cheers for now, w -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Solr SynonymFilter in Lucene analyzer
Yes, you need an analyzer that leaves successive words together as one long term. This might be easier to do with the new CharFilter tool, which processes text before it goes to the tokenizer. What you are doing here is similar to Parts-Of-Speech analysis, where text analysis software parses a sentence and labels words 'Noun', 'Verb', etc. One suite stores these labels as payloads on the terms. This might be a better way to store your categories, rather than using the synonym filter. On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter is the one that prevents multi-word synonyms like New York from getting mapped to the generic synonym name like CONCEPTYcity. It appears to me that an analyzer which recognizes that a white-space is inside a synonym like New York will be required. Do I need to implement one like this or is there already an analyzer I can use? Looks like I am missing something here, since Solr's SynonymFilter is supposed to handle this. Can someone tell me what is the correct way to integrate Solr's SynonymFilter within a custom lucene analyzer? Thanks. On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan arunrangara...@gmail.comwrote: I am trying to have multi-word synonyms work in lucene using Solr's * SynonymFilter*. I need to match synonyms at index time, since many of the synonym lists are huge. Actually they are really not synonyms, but are words that belong to a concept. For example, I would like to map {New York, Los Angeles, New Orleans, Salt Lake City...}, a bunch of city names, to the concept called city. While searching, the user query for the concept city will be translated to a keyword like, say CONCEPTcity, which is the synonym for any city name. Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131), all I could match for CONCEPTcity is single word city names like Chicago, Seattle, Boston, etc., It would not match multi-word city names like New York, Los Angeles, etc., I tried using Solr's SynonymFilter in tokenStream method in a custom Analyzer (that extends org.apache.lucene.analysis. Analyzer - lucene ver. 2.9.3) using: * public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new SynonymFilter( new WhitespaceTokenizer(reader), synonymMap); return result; } * where *synonymMap* is loaded with synonyms using *synonymMap.add(conceptTerms, listOfTokens, true, true);* where *conceptTerms* is of type *ArrayListString* of all the terms in a concept and *listofTokens* is of type *ListToken *and contains only the generic synonym identifier like *CONCEPTcity*. When I print synonymMap using synonymMap.toString(), I get the output like {New York={Chicago={Seattle={New Orleans=[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null}}}} so it looks like all the synonyms are loaded. But if I search for CONCEPTcity then it says no matches found. I am not sure whether I have loaded the synonyms correctly in the synonymMap. Any help will be deeply appreciated. Thanks! -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Migrating from Lucene 2.9.1 to Solr 1.4.0 - Performance issues under heavy load
Is this an apples to apples comparison? That is, are you measuring the same complete flow on both apps? Does the Lucene app return fields via HTTP? On Tue, Aug 3, 2010 at 11:28 AM, Ophir Adiv firt...@gmail.com wrote: Hi, I’m currently involved in a project of migrating from Lucene 2.9.1 to Solr 1.4.0. During stress testing, I encountered this performance problem: While actual search times in our shards (which are now running Solr) have not changed, the total time it takes for a query has increased dramatically. During this performance test, we of course do not modify the indexes. Our application is sending Solr select queries concurrently to the 8 shards, using CommonsHttpSolrServer. I added some timing debug messages, and found that CommonsHttpSolrServer.java, line 416 takes about 95% of the application’s total search time: int statusCode = _httpClient.executeMethod(method); Just to clarify: looking at access logs of the Solr shards, TTLB for a query might be around 5 ms. (on all shards), but httpClient.executeMethod() for this query can be much higher – say, 50 ms. On average, if under light load queries take 12 ms. on average, under heavy load the take around 22 ms. Another route we tried to pursue is add the “shards=shard1,shard2,…” parameter to the query instead of doing this ourselves, but this doesn’t seem to work due to an NPE caused by QueryComponent.returnFields(), line 553: if (returnScores sdoc.score != null) { where sdoc is null. I saw there is a null check on trunk, but since we’re currently using Solr 1.4.0’s ready-made WAR file, I didn’t see an easy way around this. Note: we’re using a custom query component which extends QueryComponent, but debugging this, I saw nothing wrong with the results at this point in the code. Our previous code used HTTP in a different manner: For each request, we created a new sun.net.www.protocol.http.HttpURLConnection, and called its getInputStream() method. Under the same load as the new application, the old application does not encounter the delays mentioned above. Our current code is initializing CommonsHttpSolrServer for each shard this way: MultiThreadedHttpConnectionManager httpConnectionManager = new MultiThreadedHttpConnectionManager(); httpConnectionManager.getParams().setTcpNoDelay(true); httpConnectionManager.getParams().setMaxTotalConnections(1024); httpConnectionManager.getParams().setStaleCheckingEnabled(false); HttpClient httpClient = new HttpClient(); HttpClientParams params = new HttpClientParams(); params.setCookiePolicy(CookiePolicy.IGNORE_COOKIES); params.setAuthenticationPreemptive(false); params.setContentCharset(StringConstants.UTF8); httpClient.setParams(params); httpClient.setHttpConnectionManager(httpConnectionManager); and passing the new HttpClient to the Solr Server: solrServer = new CommonsHttpSolrServer(coreUrl, httpClient); We tried two different ways – one with a single MultiThreadedHttpConnectionManager and HttpClient for all the SolrServer’s, and the other with a new MultiThreadedHttpConnectionManager and HttpClient for each SolrServer. Both tries yielded similar performance results. Also tried to give setMaxTotalConnections() a much higher connections number (1,000,000) – didn’t have an effect. Would love to hear what you think about this. TIA, Ophir -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rank results only on some fields
Can't this use case be done with a function query? On Sat, Jul 31, 2010 at 1:59 AM, Uwe Schindler u...@thetaphi.de wrote: Here some example code, the method is getFieldQuery() (Lucene 2.9 or 3.0 or following, don't use that approach before, because QueryWrapperFilter is not effective before 2.9 for that): @Override protected Query getFieldQuery(String field, String queryText) throws ParseException { Query q = super.getFieldQuery(field,queryText); if (!TITLE.equals(field)) q = new ConstantScoreQuery(new QueryWrapperFilter(q)); return q; } I hope that explains itself. You may look at other Query type factories in QP that produce scoring queries and wrap them similar. But e.g. WildCard and RangeQueries are constant score. Phrases are also handled by this method. Only the slop setting may not work correctly after this (look at the instanceof checks in getFieldQuery(..., slop)). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Saturday, July 31, 2010 10:19 AM To: java-user@lucene.apache.org Subject: RE: Rank results only on some fields You can construct the query using a customized query parser that wraps all queries not with the suggested field name using a new ConstantScoreQuery(new QueryWrapperFilter(originalCreatedQuery)). Override newFieldQuery() to do that and pass the super call to this ctor chain. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Philippe [mailto:mailer.tho...@gmail.com] Sent: Saturday, July 31, 2010 10:04 AM To: java-user@lucene.apache.org Subject: Rank results only on some fields Hi, I want to rank my results only on parts of my query. E.g my query is TITLE:Lucene AND AUTHOR:Manning. After this query standard lucene ranking for both fields take place. However, is it possible to query the index using the full query and rank results only according to the TITLE-Field? Regards, Philippe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rank results only on some fields
Oops, didn't notice that this was java-user. I had a Solr 'why write more code' reaction ?) On Sat, Jul 31, 2010 at 1:56 PM, Uwe Schindler u...@thetaphi.de wrote: We don't want to modify the ranking using functions, we want to switch some queries to constant score mode. The QueryParser subclassing is just to make it convenient. In general to strip off scores from queries, you use new ConstantScoreQuery(new QueryWrapperFilter(query)), this is used inside Lucene, too (MultiTermQuery,...). The trick is to normalize the Scorer to return a constant value (boost of CSQ). This can be done by first wrapping the scorer of the original query in a filter and then add a scorer to the filter again, that returns a constant. With function queries you can do something similar by returning a constant in the CustomScoreProvider. The QWF/CSQ trick is more convenient and used quite often inside Lucene, too. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Saturday, July 31, 2010 10:50 PM To: java-user@lucene.apache.org Subject: Re: Rank results only on some fields Can't this use case be done with a function query? On Sat, Jul 31, 2010 at 1:59 AM, Uwe Schindler u...@thetaphi.de wrote: Here some example code, the method is getFieldQuery() (Lucene 2.9 or 3.0 or following, don't use that approach before, because QueryWrapperFilter is not effective before 2.9 for that): @Override protected Query getFieldQuery(String field, String queryText) throws ParseException { Query q = super.getFieldQuery(field,queryText); if (!TITLE.equals(field)) q = new ConstantScoreQuery(new QueryWrapperFilter(q)); return q; } I hope that explains itself. You may look at other Query type factories in QP that produce scoring queries and wrap them similar. But e.g. WildCard and RangeQueries are constant score. Phrases are also handled by this method. Only the slop setting may not work correctly after this (look at the instanceof checks in getFieldQuery(..., slop)). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Saturday, July 31, 2010 10:19 AM To: java-user@lucene.apache.org Subject: RE: Rank results only on some fields You can construct the query using a customized query parser that wraps all queries not with the suggested field name using a new ConstantScoreQuery(new QueryWrapperFilter(originalCreatedQuery)). Override newFieldQuery() to do that and pass the super call to this ctor chain. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Philippe [mailto:mailer.tho...@gmail.com] Sent: Saturday, July 31, 2010 10:04 AM To: java-user@lucene.apache.org Subject: Rank results only on some fields Hi, I want to rank my results only on parts of my query. E.g my query is TITLE:Lucene AND AUTHOR:Manning. After this query standard lucene ranking for both fields take place. However, is it possible to query the index using the full query and rank results only according to the TITLE-Field? Regards, Philippe --- -- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for searcher memory usage?
Glen, thank you for this very thorough and informative post. Lance Norskog - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: segment_N file is missed
? That could be; Maryam is that what happened? Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: phrase search in a particular case
SpanFirstQuery is the clean option. Another option is to add a start token to each title. Then, search for startToken oil spill. This will be faster than SpanFirstQuery. But it also requires doing something weird to the field. Lance On Thu, Jun 17, 2010 at 3:19 PM, Michael McCandless luc...@mikemccandless.com wrote: SpanFirstQuery? Mike On Thu, Jun 17, 2010 at 3:23 PM, rakesh rakesh rakeshiit.2...@gmail.com wrote: Hi, I have thousands of article titles in lucene index. So for a query Oil spill I want to return all the article title starts with Oil spill. I do not want those titles which has this phrase but do not start with this. Can anyone help me. Thanks in advance Thanks rakesh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A question bout google search index?
http://research.google.com/pubs/DistributedSystemsandParallelComputing.html On Thu, Jun 10, 2010 at 1:51 AM, Yuval Feinstein yuv...@answers.com wrote: Most of the implementation of Google's search index is kept secret by Google. Based on publicly available information, the indexes are quite different - Google uses its BigTable and MapReduce technologies to efficiently distribute the index. There are similar efforts in the Lucene ecosystem - Solr Cloud is an advanced one, Which is currently in development. As Google's scoring algorithm uses hundreds of signals, I guess they store data pertinent to these signals in the index. Lucene's index holds relatively few pieces of information about every document (posting lists, term vectors, Sometimes norms and payloads). I believe there are other differences as well, But one could only guess what they are... Cheers, Yuval -Original Message- From: luocanrao [mailto:luocan19826...@sohu.com] Sent: Wednesday, June 09, 2010 5:18 PM To: java-user@lucene.apache.org Subject: A question bout google search index? A news bout google search index. Index system of Lucene can also support realtime search, Is there some difference between them? With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before-no matter when or where it was published. Caffeine lets us index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: segment_N file is missed
The CheckIndex class/program will recreate the segment files when it removes a segment from an index. That's the only source I've found for how to make these files. If you are able to hack this up, making a CFSDirectory would be a wonderful addition to the Lucene Directory suite. A CFS file is a complete Lucene index and it is much much easier to deploy single files than file sets. On Wed, Jun 9, 2010 at 6:33 AM, maryam ma'danipour m.madanip...@gmail.com wrote: Hello to all ! I have _0.cfs file of a lucene index directory but segments.gen and segments_2 are missing. Can I generate the segments.gen and segments_2 files without having to regenerate the _0.cfs file. Does these segments files contain any index specific data, which will thus force me to regenerate the entire index again. Or can I just generate the two segments file by copying these from another lucene index directory generated with the same lucene version or can I merge this inex with another index which has segments_N to retrieve the data ? Thanks -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Solr tutorial
Use solr-user@ instead of java-user@ . You'll find more knowledgeable people. On Mon, May 31, 2010 at 6:36 PM, N Hira nh...@cognocys.com wrote: I don't know of a single tutorial that puts it all together, but the rich documents feature implemented in Solr-284 would be where I would start: https://issues.apache.org/jira/browse/SOLR-284 Look here if you're using Solr 1.4 -- it should address your needs: http://wiki.apache.org/solr/ExtractingRequestHandler Good luck, -h - Original Message From: s...@icarinae.com s...@icarinae.com To: java-user@lucene.apache.org Sent: Mon, May 31, 2010 8:17:02 PM Subject: Solr tutorial Hi, I am kind of struggling to setup Solr to search pdf files. I am following documents from lucidimagination and wiki. Can someone please point to a good Solr tutorial which involve step by step instrunctions to search/index pdf document, highlighting and snippting. Thanks in advance, Deepak - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Right memory for search application
Solr's timestamp representation (TrieDateField) is tuned for space and speed. It has a compressed representation, and sorts with far less space than Strings. Also you get something called a date facet, which lets you bucketize facet searches by time block. On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Samarendra Pratap [samarz...@gmail.com] wrote: 1. Our default option is sort by score, however almost 8% of searches use sorting on a field (mmddHHMMSS). This field is indexed as string (not as NumericField or DateField). Guessing that the timestamp is practically unique for each document, sorting by String takes up a bit more than 18M * (40 bytes + 2 * mmddHHMMSS.length() bytes) ~= 1.2 GB of RAM as the Strings are cached. Coupled with the normal overhead of just opening an index of your size (500MB by your measurements?), I would have guessed that 3600MB would definitely be enough to open the index and do sorted searches. I realize that fiddling with production servers is dangerous, but connecting with JConsole and forcing a garbage collection might be acceptable? That should enable you to determine whether you're leaking memory or if it's just the JVM being greedy. I'd guess you leaking though, as HotSpot does not normally allocate up to the limit if it does not need to. Anyway, changing to one of the optimized fields for sorting dates should shave 1 GB off the memory requirement, so I'll recommend doing that no matter what the main cause of your memory problems is. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Utility program to extract a segment
Is there a program available that makes a new index with one or more segments from an existing index? (The immediate use case for this is doing forensics on corrupted indexes.) The user interface would be: extract -segments _ab,_g9 oldindex newindex This would copy the files for segments _ab and _g9 into a new directory and generate a segments.gen for just those two segments. Is this all that's needed? -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexWriter and memory usage
-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
[JOB] Solr/Lucene developer wanted in startup: San Francisco Peninsula, CA, USA
Hi- We are a startup in the web indexing space. We are looking for an experienced search engine developer for our team. We are using Solr and Lucene, but search is search and solid experience with any large search engine is welcome. At this point we do not wish to disclose our name. We have solid funding and are a real business. We have a contract with a large company to lease our index and provide various services. Thank for your time. Please contact me at [EMAIL PROTECTED] Lance Norskog 650-922-8831
UTF-8/unicode input in querying in Lucene
Hi- The page http://lucene.apache.org/java/docs/queryparsersyntax.html does not mention that \u Unicode syntax is supported. For example, \u0048\u0045\u004c\u004c\u004f is HELLO. Please add this to the page, it took experimentation to discover it. Thanks, Lance Norskog