[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-10 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

just updating patch to trunk, the nocommits remain...

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

New patch, also setting offsets in the produced tokens (the wordnet test 
passes), and adding adding a NOTE about the dup output words issue.

I think it's finally ready to commit!

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

Updated patch: 
* renamed to SynonymFilter
* added not-so-sophisticated backwards layer
* added tests
* added parser for format=wordnet
* removed contrib/wordnet

but i found some bugs (well one, surely is) in the new tests, so i added 
nocommits here.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

New patch, moving the root arcs cache into FST, not using up our last precious 
arc bit.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

Another rev of the patch: I did a hard bump the FST version (so
existing trunk indices must be rebuilt), and added NOTE in suggest's
FST impl that the file format is experimental; removed
maxVerticalContext; fixed false test failure.


 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

fixed some bugs, added some tests, but there is a problem, I started to add a 
little benchmark and I hit this on my largish synonyms file:
{noformat}
java.lang.IllegalStateException: max arc size is too large (445)
{noformat}

Just run the TestFSTSynonymFilterFactory and you will see it, i enabled some 
prints and it doesn't appear like anything totally stupid is going on... giving 
up for the night :)

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: synonyms.zip

attaching my synonyms.txt test file that i was using: its derived from wordnet

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

here is a patch with a little microbenchmark... so we have some tuning to do. 

the benchmark analyzes a short string a million times, that doesn't match any 
synonyms (actually hte solr default)

||impl||ms||
|SynonymsFilter|1692|
|FST with array arcs|2794|
|FST with no array arcs|8823|

so, disabling the array arcs is a pretty crucial hit here. but we could do 
other options to speed up this common case, e.g. with daciuk we could build a 
charrunautomaton of the K-prefixes of the synonyms, this would be really fast 
to reject these terms that don't match any syns.

or we could explicitly put our bytesref output in a byte[], and use long 
pointers as outputs.

or we could speed up FST! But i think its interesting to see how important this 
parameter is.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-06 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

New patch, including some optimizing to FST (which we can commit under a 
separate issue): array arcs can now be any size, and I re-use the BytesReader 
inner class that's created for parsing arcs.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

updated patch, this tableizes the first FST arcs for latin-1.

precomputing this tiny table speeds up this filter a ton (~3000ms - ~2000ms) 
and I think is a cheap easy win for the terms index too.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, synonyms.zip


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-05 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

patch with a first random test, this one currently does 10 iterations where it 
adds random shit to the synonym map, then it analyzes 10k random strings (each 
time capturing the output, and replaying it back to ensure the thing is 
deterministic and doesn't have reuse bugs).

i also added the ignoreCase support.

the filter might have a reuse bug, see ant test 
-Dtestcase=TestFSTSynonymMapFilter -Dtestmethod=testRandom 
-Dtests.seed=-4122723628721952592:244824441557739968


 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-05 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

New patch, folding in Robert's changes and the random stress test.  All tests 
pass.  I think it's now functionally correct, but I still need to compare perf 
vs existing syn filter, and there are still a few minor nocommits to work out.

Ideally, if we get perf close enough, since RAM is much much less w/ this new 
syn filter, I think we should replace the old one with this new one.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-05 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

New patch, adding dedup option to the builder, removing a couple nocommits, 
cutting back on iters/counts in testRandom2.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-05 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

Updated patch:
* added a SolrSynonymsParser and test to the analyzers module, that parses the 
existing solr synonyms format.
* added a Solr factory for this thing (untested!) that uses this when 
format=solr (the default)

This way, the idea is the factory would be more extensible, e.g. you could load 
syns from a database, or we could add parsers for wordnet and nuke 
contrib/wordnet, etc etc.

Still need to do some basic benchmarking.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
 LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-07-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3233.patch

New patch w/ current state.

I think it's closer; the test has more cases now (but I'd still like to make a 
random test), fewer nocommits, etc.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-06-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3233:
---

Attachment: LUCENE-3223.patch

Dumping my current state on FSTSynonymFilter -- it compiles but it's got tons 
of bugs I'm sure!  I added a trivial initial test.

 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3223.patch, LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

2011-06-23 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3233:


Attachment: LUCENE-3233.patch

here's a rough start to building a datastructure that I think makes good 
tradeoffs between RAM and processing.

No matter what, the processing on the filter-side will be hairy because of the 
'interleaving' with the tokenstream.

This one is just an FSTCharsRef,Int[](BYTE4) where Int is an ord to a 
BytesRefHash, containing the output Bytes for each term.

This way, at input time we can walk the FST with codePointAt()

On both sides, the Chars/Bytes are actually phrases, using \u as a word 
separator.


 HuperDuperSynonymsFilter™
 -

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-3233.patch


 The current synonymsfilter uses a lot of ram and cpu, especially at build 
 time.
 I think yesterday I heard about huge synonyms files three times.
 So, I think we should use an FST-based structure, sharing the inputs and 
 outputs.
 And we should be more efficient with the tokenStream api, e.g. using 
 save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org