IndexableBinaryStringTools (was FieldCache)
Hi, > > [...] I tried to use IndexableBinaryStringTools to re-encode my 11 byte > > array. The size was increased to 7 characters (= 14 bytes) > > which is still a gain of more than 50 percent compared to the UTF8 > > encoding. BTW: I found no sample how to use the > > IndexableBinaryStringTools class except in the unit tests. > > IndexableBinaryStringTools will eventually be deprecated and then dropped, in > favor of native > indexable/searchable binary terms. More work is required before these are > possible, though. > > Well-maintained unit tests are not a bad way to describe functionality... Sure, but there is no unit test for Solr. > > I assume that the char[] returned form IndexableBinaryStringTools.encode > > is encoded in UTF-8 again and then stored. At some point > > the information is lost and cannot be recovered. > > Can you give an example? This should not happen. It's hard to give an example output, because the binary string representation contains unprintiple characters. I'll try to explain what I'm doing. My character array returned by IndexableBinaryStringTools.encode looks like following: char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0}; Then I add it to a SolrInputDocument: SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", new String(encoded)); If I now print the SolrInputDocument using System.out.println(doc), the String representation of the character array is correct. Then I add it to a RAMDirectory: ArrayList docs = new ArrayList(); docs.add(doc); solrServer.add(docs); solrServer.commit(); ... and immediately retrieve it like follows: SolrQuery query = new SolrQuery(); query.setQuery("*:*"); QueryResponse rsp = solrServer.query(query); SolrDocumentList docList = rsp.getResults(); System.out.println(docList); Now the string representation of the SolrDocuments ID looks different than that of the SolrInputDocument. If I do not create a new string in doc.addField, just the string representation of the array address will be added the the SolrInputDocument. BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk. Why has the string representation changed? From the changed string I cannot decode the correct ID. -- Kind regards, Mathias
RE: FieldCache
Hi, > On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter > wrote: > > I indexed about 90 million sentences and the PAS (predicate argument > structures) they consist of (which are about 500 million). Then > > I try to do NER (named entity recognition) by searching about 5 million > entities. For each entity I need the all search results, not > > just the top X. Since about 10 percent of the entities are high frequent (i. > e. there are more than 5 million hits for "human"), it > > takes very long to obtain the data from the index. "Very long" means about a > day with 15 distributed Katta nodes. Katta is just a > > distribution and shard balancing solution on top of Lucene. > > if you aren't getting top-N results/doing search, are you sure a > search engine library/server is the right tool for this job? No, I'm not sure, but I didn't find another solution. Any other solution also has to create some kind of index and has to provide some search API. Because I need SpanNearQuery and PhraseQuery to find some multi-term entities, I think Solr/Lucene is a good starting point. Also, I need the classic top-N results for the web application. So a single solution is preferred. > > Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte > array. The size was increased to 7 characters (= 14 bytes) > > which is still a gain of more than 50 percent compared to the UTF8 encoding. > BTW: I found no sample how to use the > > IndexableBinaryStringTools class except in the unit tests. > > it is deprecated in trunk, because you can index binary terms (your > own byte[]) directly if you want. To do this, you need to use a custom > AttributeFactory. How do I use it with Solr, i. e. how to set up a schema.xml using a custom AttributeFactory? -- Kind regards, Mathias
AW: FieldCache
I don't think it is an XY problem. I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for "human"), it takes very long to obtain the data from the index. "Very long" means about a day with 15 distributed Katta nodes. Katta is just a distribution and shard balancing solution on top of Lucene. Initially, I tried distributed search with Solr. But it was too slow to retrieve a large set of documents. Then I switch to Lucene and made some improvements. I enabled the field cache for my ID field and another single char field (PAS type) to get the benefit of accessing the fields with an array. Unfortunately, the IDs are too large to fit in memory. I gave 12 GB of RAM to each node and also tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of memory. Then I investigated the storage of the fields. String fields are stored in UTF-8 encoding. But my ID will never contain UTF8 characters. It consists of number schema but does not fit into a single long. I encoded it into a byte array of 11 bytes (compared to 30 bytes of UTF-8 encoding). Then I changed the field description in schema.xml to binary. I still use the EmbeddedSolrServer to create the indices. Also, I had to remove the uniquekey node because binary fields cannot be indexed, which is the requirement for the unique key. After reindexing I discovered that nonindexed or binary fields cannot be used with the FieldCache. Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes) which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the IndexableBinaryStringTools class except in the unit tests. Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene client. The search result never looked identical compared to the IDs used to create the SolrInputDocument. I assume that the char[] returned form IndexableBinaryStringTools.encode is encoded in UTF-8 again and then stored. At some point the information is lost and cannot be recovered. Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from FieldCache.DEFAULT.getTerms directly. But the bytes are encoded in an unknown form (unknown to me) and cannot be decoded with IndexableBinaryStringTools.decode. The question is now, how to increase the performance of the binary field retrieval by not exploding the memory? I also read some comments which suggest using of payloads. But I never tried this approach. Also, the column-stride fields approach (LUCENE-2186) looks promising but is not released yet. BTW: I made some tests with a smaller index and the ID encoded as string. Using the field cache improves the hit retrieval dramatically (from 18 seconds down to 2 seconds per query, with a large number of results). -- Kind regards, Mathias > -Ursprüngliche Nachricht- > Von: Erick Erickson [mailto:erickerick...@gmail.com] > Gesendet: Samstag, 23. Oktober 2010 21:40 > An: solr-user@lucene.apache.org > Betreff: Re: FieldCache > > Why do you want to? Basically, the caches are there to improve > #searching#. To search something, you must index it. Retrieving > it is usually a rare enough operation that caching is irrelevant. > > This smells like an XY problem, see: > http://people.apache.org/~hossman/#xyproblem > > If this seems like gibberish, could you explain your problem > a little more? > > Best > Erick > > On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter > wrote: > > > Hi, > > > > does a field which should be cached needs to be indexed? > > > > I have a binary field which is just stored. Retrieving it via > > FieldCache.DEFAULT.getTerms returns empty ByteRefs. > > > > Then I found the following post: > > http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html > > > > How can I use the FieldCache with a binary field? > > > > -- > > Kind regards, > > Mathias > > > >
FieldCache
Hi, does a field which should be cached needs to be indexed? I have a binary field which is just stored. Retrieving it via FieldCache.DEFAULT.getTerms returns empty ByteRefs. Then I found the following post: http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html How can I use the FieldCache with a binary field? -- Kind regards, Mathias
RE: Using Solr Analyzers in Lucene
Hi Max, why don't you use WordDelimiterFilterFactory directly? I'm doing the same stuff inside my own analyzer: final Map args = new HashMap(); args.put("generateWordParts", "1"); args.put("generateNumberParts", "1"); args.put("catenateWords", "0"); args.put("catenateNumbers", "0"); args.put("catenateAll", "0"); args.put("splitOnCaseChange", "1"); args.put("splitOnNumerics", "1"); args.put("preserveOriginal", "1"); args.put("stemEnglishPossessive", "0"); args.put("language", "English"); wordDelimiter = new WordDelimiterFilterFactory(); wordDelimiter.init(args); stream = wordDelimiter.create(stream); -- Kind regards, Mathias > -Original Message- > From: Max Lynch [mailto:ihas...@gmail.com] > Sent: Tuesday, October 05, 2010 1:03 AM > To: solr-user@lucene.apache.org > Subject: Re: Using Solr Analyzers in Lucene > > I have made progress on this by writing my own Analyzer. I basically added > the TokenFilters that are under each of the solr factory classes. I had to > copy and paste the WordDelimiterFilter because, of course, it was package > protected. > > > > On Mon, Oct 4, 2010 at 3:05 PM, Max Lynch wrote: > > > Hi, > > I asked this question a month ago on lucene-user and was referred here. > > > > I have content being analyzed in Solr using these tokenizers and filters: > > > > > positionIncrementGap="100"> > > > > > > > > > generateWordParts="0" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > protected="protwords.txt"/> > > > > > > > > > generateWordParts="0" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > protected="protwords.txt"/> > > > > > > > > Basically I want to be able to search against this index in Lucene with one > > of my background searching applications. > > > > My main reason for using Lucene over Solr for this is that I use the > > highlighter to keep track of exactly which terms were found which I use for > > my own scoring system and I always collect the whole set of found > > documents. I've messed around with using Boosts but it wasn't fine grained > > enough and I wasn't able to effectively create a score threshold (would > > creating my own scorer be a better idea?) > > > > Is it possible to use this analyzer from Lucene, or at least re-create it > > in code? > > > > Thanks. > > > >
AW: WordDelimiterFilter combined with PositionFilter
Hi Robert, > On Fri, Sep 24, 2010 at 3:54 AM, Mathias Walter wrote: > > > Hi, > > > > I'm combined the WordDelimiterFilter with the PositionFilter to prevent the > > creation of expensive Phrase and MultiPhraseQueries. But > > if I now parse an escaped string consisting of two terms, the analyser > > returns a BooleanQuery. That's not what I would expect. If a > > string is escaped, I would expect a PhraseQuery and not a BooleanQuery. > > > > What should be the correct behavior? > > > > > instead of PositionFilter, you can upgrade to either trunk or branch_3x from > svn, and use: > > autoGeneratePhraseQueries="false"> > > then you will get phrase queries when the user asked for them, but not > automatically. Are term vector positions still correctly computed if positionIncrementGap is used? -- Kind regards, Mathias
WordDelimiterFilter combined with PositionFilter
Hi, I'm combined the WordDelimiterFilter with the PositionFilter to prevent the creation of expensive Phrase and MultiPhraseQueries. But if I now parse an escaped string consisting of two terms, the analyser returns a BooleanQuery. That's not what I would expect. If a string is escaped, I would expect a PhraseQuery and not a BooleanQuery. What should be the correct behavior? -- Kind regards, Mathias