IndexableBinaryStringTools (was FieldCache)

2010-11-02 Thread Mathias Walter
Hi,

> > [...] I tried to use IndexableBinaryStringTools to re-encode my 11 byte
> > array. The size was increased to 7 characters (= 14 bytes)
> > which is still a gain of more than 50 percent compared to the UTF8
> > encoding. BTW: I found no sample how to use the
> > IndexableBinaryStringTools class except in the unit tests.
> 
> IndexableBinaryStringTools will eventually be deprecated and then dropped, in 
> favor of native
> indexable/searchable binary terms.  More work is required before these are 
> possible, though.
> 
> Well-maintained unit tests are not a bad way to describe functionality...

Sure, but there is no unit test for Solr.

> > I assume that the char[] returned form IndexableBinaryStringTools.encode
> > is encoded in UTF-8 again and then stored. At some point
> > the information is lost and cannot be recovered.
> 
> Can you give an example?  This should not happen.

It's hard to give an example output, because the binary string representation 
contains unprintiple characters. I'll try to explain what I'm doing.

My character array returned by IndexableBinaryStringTools.encode looks like 
following:

char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0};

Then I add it to a SolrInputDocument:

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", new String(encoded));

If I now print the SolrInputDocument using System.out.println(doc), the String 
representation of the character array is correct.

Then I add it to a RAMDirectory:

ArrayList docs = new ArrayList();
docs.add(doc);
solrServer.add(docs);
solrServer.commit();

... and immediately retrieve it like follows:

SolrQuery query = new SolrQuery();
query.setQuery("*:*");
QueryResponse rsp = solrServer.query(query);
SolrDocumentList docList = rsp.getResults();
System.out.println(docList);

Now the string representation of the SolrDocuments ID looks different than that 
of the SolrInputDocument.

If I do not create a new string in doc.addField, just the string representation 
of the array address will be added the the SolrInputDocument.

BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk.

Why has the string representation changed? From the changed string I cannot 
decode the correct ID.

--
Kind regards,
Mathias



RE: FieldCache

2010-10-25 Thread Mathias Walter
Hi,

> On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter 
> wrote:
> > I indexed about 90 million sentences and the PAS (predicate argument
> structures) they consist of (which are about 500 million). Then
> > I try to do NER (named entity recognition) by searching about 5 million
> entities. For each entity I need the all search results, not
> > just the top X. Since about 10 percent of the entities are high frequent (i.
> e. there are more than 5 million hits for "human"), it
> > takes very long to obtain the data from the index. "Very long" means about a
> day with 15 distributed Katta nodes. Katta is just a
> > distribution and shard balancing solution on top of Lucene.
> 
> if you aren't getting top-N results/doing search, are you sure a
> search engine library/server is the right tool for this job?

No, I'm not sure, but I didn't find another solution. Any other solution also 
has to create some kind of index and has to provide some search API. Because I 
need SpanNearQuery and PhraseQuery to find some multi-term entities, I think 
Solr/Lucene is a good starting point. Also, I need the classic top-N results 
for the web application. So a single solution is preferred.

> > Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte
> array. The size was increased to 7 characters (= 14 bytes)
> > which is still a gain of more than 50 percent compared to the UTF8 encoding.
> BTW: I found no sample how to use the
> > IndexableBinaryStringTools class except in the unit tests.
> 
> it is deprecated in trunk, because you can index binary terms (your
> own byte[]) directly if you want. To do this, you need to use a custom
> AttributeFactory.

How do I use it with Solr, i. e. how to set up a schema.xml using a custom 
AttributeFactory?

--
Kind regards,
Mathias



AW: FieldCache

2010-10-25 Thread Mathias Walter
I don't think it is an XY problem.

I indexed about 90 million sentences and the PAS (predicate argument 
structures) they consist of (which are about 500 million). Then
I try to do NER (named entity recognition) by searching about 5 million 
entities. For each entity I need the all search results, not
just the top X. Since about 10 percent of the entities are high frequent (i. e. 
there are more than 5 million hits for "human"), it
takes very long to obtain the data from the index. "Very long" means about a 
day with 15 distributed Katta nodes. Katta is just a
distribution and shard balancing solution on top of Lucene.

Initially, I tried distributed search with Solr. But it was too slow to 
retrieve a large set of documents. Then I switch to Lucene
and made some improvements. I enabled the field cache for my ID field and 
another single char field (PAS type) to get the benefit of
accessing the fields with an array. Unfortunately, the IDs are too large to fit 
in memory. I gave 12 GB of RAM to each node and also
tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of 
memory.

Then I investigated the storage of the fields. String fields are stored in 
UTF-8 encoding. But my ID will never contain UTF8
characters. It consists of number schema but does not fit into a single long. I 
encoded it into a byte array of 11 bytes (compared
to 30 bytes of UTF-8 encoding). Then I changed the field description in 
schema.xml to binary. I still use the EmbeddedSolrServer to
create the indices.
Also, I had to remove the uniquekey node because binary fields cannot be 
indexed, which is the requirement for the unique key.

After reindexing I discovered that nonindexed or binary fields cannot be used 
with the FieldCache.

Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. 
The size was increased to 7 characters (= 14 bytes)
which is still a gain of more than 50 percent compared to the UTF8 encoding. 
BTW: I found no sample how to use the
IndexableBinaryStringTools class except in the unit tests.

Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene 
client. The search result never looked identical
compared to the IDs used to create the SolrInputDocument.

I assume that the char[] returned form IndexableBinaryStringTools.encode is 
encoded in UTF-8 again and then stored. At some point
the information is lost and cannot be recovered.

Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from 
FieldCache.DEFAULT.getTerms directly. But the bytes are
encoded in an unknown form (unknown to me) and cannot be decoded with 
IndexableBinaryStringTools.decode.

The question is now, how to increase the performance of the binary field 
retrieval by not exploding the memory?

I also read some comments which suggest using of payloads. But I never tried 
this approach. Also, the column-stride fields approach
(LUCENE-2186) looks promising but is not released yet.

BTW: I made some tests with a smaller index and the ID encoded as string. Using 
the field cache improves the hit retrieval
dramatically (from 18 seconds down to 2 seconds per query, with a large number 
of results).

--
Kind regards,
Mathias

> -Ursprüngliche Nachricht-
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Samstag, 23. Oktober 2010 21:40
> An: solr-user@lucene.apache.org
> Betreff: Re: FieldCache
> 
> Why do you want to? Basically, the caches are there to improve
> #searching#. To search something, you must index it. Retrieving
> it is usually a rare enough operation that caching is irrelevant.
> 
> This smells like an XY problem, see:
> http://people.apache.org/~hossman/#xyproblem
> 
> If this seems like gibberish, could you explain your problem
> a little more?
> 
> Best
> Erick
> 
> On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter 
> wrote:
> 
> > Hi,
> >
> > does a field which should be cached needs to be indexed?
> >
> > I have a binary field which is just stored. Retrieving it via
> > FieldCache.DEFAULT.getTerms returns empty ByteRefs.
> >
> > Then I found the following post:
> > http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html
> >
> > How can I use the FieldCache with a binary field?
> >
> > --
> > Kind regards,
> > Mathias
> >
> >



FieldCache

2010-10-21 Thread Mathias Walter
Hi,

does a field which should be cached needs to be indexed?

I have a binary field which is just stored. Retrieving it via 
FieldCache.DEFAULT.getTerms returns empty ByteRefs.

Then I found the following post: 
http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html

How can I use the FieldCache with a binary field?

--
Kind regards,
Mathias



RE: Using Solr Analyzers in Lucene

2010-10-05 Thread Mathias Walter
Hi Max,

why don't you use WordDelimiterFilterFactory directly? I'm doing the same
stuff inside my own analyzer:

final Map args = new HashMap();

args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "0");
args.put("catenateNumbers", "0");
args.put("catenateAll", "0");
args.put("splitOnCaseChange", "1");
args.put("splitOnNumerics", "1");
args.put("preserveOriginal", "1");
args.put("stemEnglishPossessive", "0");
args.put("language", "English");

wordDelimiter = new WordDelimiterFilterFactory();
wordDelimiter.init(args);
stream = wordDelimiter.create(stream);

--
Kind regards,
Mathias

> -Original Message-
> From: Max Lynch [mailto:ihas...@gmail.com]
> Sent: Tuesday, October 05, 2010 1:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solr Analyzers in Lucene
> 
> I have made progress on this by writing my own Analyzer.  I basically
added
> the TokenFilters that are under each of the solr factory classes.  I had
to
> copy and paste the WordDelimiterFilter because, of course, it was package
> protected.
> 
> 
> 
> On Mon, Oct 4, 2010 at 3:05 PM, Max Lynch  wrote:
> 
> > Hi,
> > I asked this question a month ago on lucene-user and was referred here.
> >
> > I have content being analyzed in Solr using these tokenizers and
filters:
> >
> >  > positionIncrementGap="100">
> >
> >  
> >
> >  > generateWordParts="0" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > protected="protwords.txt"/>
> >   
> >   
> > 
> >  > generateWordParts="0" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > protected="protwords.txt"/>
> >   
> > 
> >
> > Basically I want to be able to search against this index in Lucene with
one
> > of my background searching applications.
> >
> > My main reason for using Lucene over Solr for this is that I use the
> > highlighter to keep track of exactly which terms were found which I use
for
> > my own scoring system and I always collect the whole set of found
> > documents.  I've messed around with using Boosts but it wasn't fine
grained
> > enough and I wasn't able to effectively create a score threshold (would
> > creating my own scorer be a better idea?)
> >
> > Is it possible to use this analyzer from Lucene, or at least re-create
it
> > in code?
> >
> > Thanks.
> >
> >



AW: WordDelimiterFilter combined with PositionFilter

2010-09-29 Thread Mathias Walter
Hi Robert,

> On Fri, Sep 24, 2010 at 3:54 AM, Mathias Walter wrote:
> 
> > Hi,
> >
> > I'm combined the WordDelimiterFilter with the PositionFilter to prevent the
> > creation of expensive Phrase and MultiPhraseQueries. But
> > if I now parse an escaped string consisting of two terms, the analyser
> > returns a BooleanQuery. That's not what I would expect. If a
> > string is escaped, I would expect a PhraseQuery and not a BooleanQuery.
> >
> > What should be the correct behavior?
> >
> >
> instead of PositionFilter, you can upgrade to either trunk or branch_3x from
> svn, and use:
> 
>  autoGeneratePhraseQueries="false">
> 
> then you will get phrase queries when the user asked for them, but not
> automatically.

Are term vector positions still correctly computed if positionIncrementGap is 
used?

--
Kind regards,
Mathias



WordDelimiterFilter combined with PositionFilter

2010-09-24 Thread Mathias Walter
Hi,

I'm combined the WordDelimiterFilter with the PositionFilter to prevent the 
creation of expensive Phrase and MultiPhraseQueries. But
if I now parse an escaped string consisting of two terms, the analyser returns 
a BooleanQuery. That's not what I would expect. If a
string is escaped, I would expect a PhraseQuery and not a BooleanQuery.

What should be the correct behavior?

--
Kind regards,
Mathias