RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread Steven A Rowe
Hi Phani,

Assuming you're using Lucene 3.6.X, see:

http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/core/src/java/org/apache/lucene/analysis/standard/READ_BEFORE_REGENERATING.txt
 

and

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_6/lucene/common-build.xml?revision=1364130view=markup#l356

I've pasted the relevant contents below:


WARNING: if you change StandardTokenizerImpl*.jflex or UAX29URLEmailTokenizer
and need to regenerate the tokenizer, only use the trunk version
of JFlex 1.5 (with a minimum SVN revision 597) at the moment!

Please install the jFlex 1.5 version (currently not released)
from its SVN repository:

 svn co http://jflex.svn.sourceforge.net/svnroot/jflex/trunk jflex
 cd jflex
 mvn install

Then, create a build.properties file either in your home
directory, or within the Lucene directory and set the jflex.home
property to the path where the JFlex trunk checkout is located
(in the above example its the directory called jflex). 


Steve

-Original Message-
From: vempap [mailto:phani.vemp...@emc.com] 
Sent: Thursday, October 04, 2012 7:43 PM
To: d...@lucene.apache.org
Subject: StandardTokenizer generation from JFlex grammar

Hello,

  I'm trying to generate the standard tokenizer again using the jflex
specification (StandardTokenizerImpl.jflex) but I'm not able to do so due to
some errors (I would like to create my own jflex file using the standard
tokenizer which is why I'm trying to first generate using that to get a hang
of things).

I'm using jflex 1.4.3 and I ran into the following error:

Error in file filename (line 64): 
Syntax error.
HangulEx   = (!(!\p{Script:Hangul}|!\p{WB:ALetter})) ({Format} |
{Extend})*


Also, I tried installing an eclipse plugin from
http://cup-lex-eclipse.sourceforge.net/ which I thought would provide
options similar to JavaCC (http://eclipse-javacc.sourceforge.net/) through
which we can generate classes within eclipse - but had a hard luck.

Any help would be very helpful.

Regards,
Phani.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/StandardTokenizer-generation-from-JFlex-grammar-tp4011939.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Using stop words with snowball analyzer and shingle filter

2012-09-19 Thread Steven A Rowe
Hi Martin,

SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in Lucene 
5.0.

Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on 
Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for 
PorterStemmer; disabling stopword holes' position increments; and adding 
ShingleFilter), that should basically do what you want:

--
String[] stopWords = new String[] { ... };
Set? stopSet = StopFilter.makeStopSet(matchVersion, stopWords);
String[] stemExclusions = new String[] { ... };
Set? stemExclusionsSet = new HashSet?();
stemExclusionsSet.addAll(Arrays.asList(stemExclusions));
matchVersion = Version.LUCENE_3X;

Analyzer analyzer = new ReusableAnalyzerBase() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
  result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopSet);
((StopFilter)result).setEnablePositionIncrements(false);  // Disable holes' 
position increments
if (stemExclusionsSet.size()  0) {
  result = new KeywordMarkerFilter(result, stemExclusionsSet);
}
result = new SnowballFilter(result, English);
result = new ShingleFilter(result, this.getnGramLength());
return new TokenStreamComponents(source, result);
  }
};
--

Steve

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, September 19, 2012 7:16 PM
To: java-user@lucene.apache.org
Subject: Re: Using stop words with snowball analyzer and shingle filter

The underscores are due to the fact that the StopFilter defaults to enable 
position increments, so there are no terms at the positions where the stop 
words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is 
final so you can't subclass it to override the createComponents method 
that creates the StopFilter, so you would essentially have to copy the 
source for SnowballAnalyzer and then add in the code to invoke 
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-Original Message- 
From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, English,
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text satellite is definitely falling to
Earth for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
satellite falling and falling earth, and satellite falling earth?








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: ReferenceManager.maybeRefreshBlocking() should not be declared throwing InterruptedException

2012-07-21 Thread Steven A Rowe
Hi Vitaly,

Info here should help you set up snapshot dependencies:

http://wiki.apache.org/lucene-java/NightlyBuilds

Steve

-Original Message-
From: Vitaly Funstein [mailto:vfunst...@gmail.com] 
Sent: Saturday, July 21, 2012 9:22 PM
To: java-user@lucene.apache.org
Subject: Re: ReferenceManager.maybeRefreshBlocking() should not be declared 
throwing InterruptedException

Yeah, this is in 4.0-ALPHA. What should I updated my Maven dependency to
get the latest snapshots, instead - if that's where the fix is?

On Sat, Jul 21, 2012 at 4:16 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Thanks Vitaly.

 I think you are looking at an older 4.x/5.x version?  We recently
 removed declaration of this (unchecked) exception... (LUCENE-4172).

 Mike McCandless

 http://blog.mikemccandless.com

 On Fri, Jul 20, 2012 at 11:26 PM, Vitaly Funstein vfunst...@gmail.com
 wrote:
  This probably belongs in the JIRA, and is related to
  https://issues.apache.org/jira/browse/LUCENE-4025, but
  java.util.Lock.lock() doesn't throw anything. I believe the author of the
  change originally meant to use lockInterruptibly() inside but forgot to
  adjust the method sig after changing it back to lock().

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: RAMDirectory and expungeDeletes()/optimize()

2012-07-11 Thread Steven A Rowe
Nabble silently drops content from email sent through their interface on a 
regular basis.  I've told them about it multiple times.  My suggestion: find 
another way to post to this mailing list.

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, July 11, 2012 10:07 AM
To: java-user@lucene.apache.org
Subject: Re: RAMDirectory and expungeDeletes()/optimize()

What I meant was your original email says My code looks like,
followed by blank lines, and then Doesn't it conflict with the
JavaDoc saying:, followed by blank lines. Ie we can't see your code.

However, when I look at your email here at
http://lucene.472066.n3.nabble.com/RAMDirectory-and-expungeDeletes-optimize-td3994350.html#a3994387
I do see the code and javadocs.

But when I look at http://lucene.markmail.org/thread/z5gcms6lp4bo5hfs
and 
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201207.mbox/%3c1342001903207-3994350.p...@n3.nabble.com%3e
they are missing.

Not sure what's going on.  Maybe your email was originally HTML but
got converted to plain text somewhere along the way, losing those
important parts?

Anyway, to try to answer your question: you should be able to simply
call optimize (forceMerge(1)): it does what expungeDeletes does, and
more (merges down to 1 segment).  Yes, it's horribly costly, and so
you should do it rarely, but it sounds like it may be OK in this case
(one time thing before you send a segment off to the main index).
Still, you should test whether it actually helps in the end, because
likely the main index will have to merge these segments anyway (if
enough are added) which'd mean the merging you did on adding them was
redundant (unless bandwidth is very costly...).

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 11, 2012 at 9:55 AM, Konstantyn Smirnov inject...@yahoo.com wrote:
 JavaDoc comes from here
 http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#expungeDeletes()

 other blanks are here because it's groovy :) Or what did you mean exactly?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/RAMDirectory-and-expungeDeletes-optimize-tp3994350p3994387.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: how to remove the dash

2012-06-25 Thread Steven A Rowe
I added the following to both TestStandardAnalyzer and TestClassicAnalyzer in 
branches/lucene_solr_3_6/, and it passed in both cases:

  public void testWhitespaceHyphenWhitespace() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
  (a, drinks - water, new String[]{drinks, water});
  }

So I'm not seeing the same behavior as you guys - the hyphen is not part of any 
emitted token.

Steve

-Original Message-
From: lis...@alphamatrix.org [mailto:lis...@alphamatrix.org] 
Sent: Monday, June 25, 2012 11:33 AM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash

A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
 My apologies - you are right.
 
 With both ClassicAnalyzer and StandardAnalyzer, drinks - water 
comes
 out as drinks -water whereas drinks-water comes out as drinks 
 water, as I'd expected.
 
 I guess this is fixable in JFlex, or I think there is some replace 
 tokenizer somewhere that can replace character X with character Y
e.g.
 - with  .  Or pre-process your text/queries with a regexp.  Maybe 
 someone else has better ideas.

I guess the same... I'am already using my own Tokenizer(based on
StandardTokenizer) to mark some strings for replacement or removal and i'am 
using a a filter to replace them and the filter to remove... And tried to do 
that with the - but didn't worked... I can't even mark the -.
I'am avoiding pre-process...
I'am hoping that somebody could tell what can I change on StandardTokenizer 
JFlex to changes this behavior.

Thanks

 
 
 --
 Ian.




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: [MAVEN] Heads up: build changes

2012-05-09 Thread Steven A Rowe
Cool, thanks for reporting back.

-Original Message-
From: Greg Bowyer [mailto:gbow...@fastmail.co.uk] 
Sent: Wednesday, May 09, 2012 1:54 PM
To: java-user@lucene.apache.org
Subject: Re: [MAVEN] Heads up: build changes

Sorry this was my fault, I found that my bsf jars were broken in my ant install.

On 08/05/12 14:32, Greg Bowyer wrote:
 greg@localhost ~ $ java -version
 java version 1.7.0_04
 Java(TM) SE Runtime Environment (build 1.7.0_04-b20) Java HotSpot(TM) 
 64-Bit Server VM (build 23.0-b21, mixed mode)

 greg@localhost ~ $ uname -a
 Linux localhost 2.6.39 #4 SMP Sun Aug 21 13:53:29 PDT 2011 x86_64
 Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz GenuineIntel GNU/Linux


 On 08/05/12 11:24, Steven A Rowe wrote:
 Hi Greg,

 I don't see that problem - 'ant generate-maven-artifacts' just works for me.

 I suspect that the XSLT processor included with your JDK does not support 
 the EXSLT str:split functionality used in the lucene/site/xsl/index.xsl 
 stylesheet, which is invoked from the 'process-webpages' target.  What 
 JDK/version/vendor/platform are you using?

 Steve

 -Original Message-
 From: Greg Bowyer [mailto:gbow...@fastmail.co.uk]
 Sent: Tuesday, May 08, 2012 4:54 PM
 To: java-user@lucene.apache.org
 Subject: Re: [MAVEN] Heads up: build changes

 For me ant generate-maven-artifacts if giving me this error, any thoughts ?

 -- %   --

 process-webpages:
 [xslt] Processing 
 /home/greg/projects/lucene-solr/lucene/build.xml
 to /home/greg/projects/lucene-solr/lucene/build/docs/index.html
 [xslt] Loading stylesheet
 /home/greg/projects/lucene-solr/lucene/site/xsl/index.xsl
 [copy] Copying 2 files to
 /home/greg/projects/lucene-solr/lucene/build/docs
 [copy] Copying
 /home/greg/projects/lucene-solr/lucene/JRE_VERSION_MIGRATION.txt to 
 /home/greg/projects/lucene-solr/lucene/build/docs/JRE_VERSION_MIGRATION.html
 [copy] 08-May-2012 12:08:35 org.apache.bsf.BSFManager 
 loadScriptingEngine
 [copy] SEVERE: Exception :
 [copy] java.lang.ClassNotFoundException:
 org.apache.bsf.engines.javascript.JavaScriptEngine
 [copy] at
 org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1361)
 [copy] at
 org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1311)
 [copy] at
 org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1064)
 [copy] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 [copy] at
 org.apache.bsf.BSFManager.loadScriptingEngine(BSFManager.java:693)
 [copy] at org.apache.bsf.BSFManager.exec(BSFManager.java:485)
 [copy] at
 org.apache.tools.ant.util.optional.ScriptRunner.executeScript(ScriptRunner.java:100)
 [copy] at
 org.apache.tools.ant.types.optional.ScriptFilter.filter(ScriptFilter.java:110)
 [copy] at
 org.apache.tools.ant.filters.TokenFilter.read(TokenFilter.java:114)
 [copy] at
 org.apache.tools.ant.filters.BaseFilterReader.read(BaseFilterReader.java:83)
 [copy] at java.io.BufferedReader.read1(BufferedReader.java:185)
 [copy] at java.io.BufferedReader.read(BufferedReader.java:261)
 [copy] at
 org.apache.tools.ant.util.ResourceUtils.copyResource(ResourceUtils.java:494)
 [copy] at
 org.apache.tools.ant.util.FileUtils.copyFile(FileUtils.java:559)
 [copy] at
 org.apache.tools.ant.taskdefs.Copy.doFileOperations(Copy.java:875)
 [copy] at
 org.apache.tools.ant.taskdefs.Copy.execute(Copy.java:549)
 [copy] at
 org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
 [copy] at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown
 Source)
 [copy] at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [copy] at java.lang.reflect.Method.invoke(Method.java:597)
 [copy] at
 org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
 [copy] at org.apache.tools.ant.Task.perform(Task.java:348)
 [copy] at
 org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:68)
 [copy] at
 org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
 [copy] at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown
 Source)
 [copy] at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [copy] at java.lang.reflect.Method.invoke(Method.java:597)
 [copy] at
 org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
 [copy] at org.apache.tools.ant.Task.perform(Task.java:348)
 [copy] at
 org.apache.tools.ant.taskdefs.MacroInstance.execute(MacroInstance.java:398)
 [copy] at
 org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
 [copy] at sun.reflect.GeneratedMethodAccessor4.invoke

[MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
If you use the Lucene/Solr Maven POMs to drive the build, I committed a major 
change last night (see https://issues.apache.org/jira/browse/LUCENE-3948 for 
more details):

* 'ant get-maven-poms' no longer places pom.xml files under the lucene/ and 
solr/ directories.  Instead, they are placed in a new top-level directory: 
maven-build/.

* When you run 'mvn whatever' under maven-build/, build and test output now 
goes under the conventional Maven target/ directories associated with each 
module's POM under the top-level maven-build/ directory.  Maven build and test 
outputs are now completely separate from those produced by the Ant build.

The above changes don't affect the 'ant generate-maven-artifacts' process - the 
top-level maven-build/ directory is not involved.  (Instead, the 
'generate-maven-artifacts' target calls a separate target - 
'filter-pom-templates' - to copy the POMs to lucene/build/poms/ and interpolate 
their versions.)

Please let me know if you run into problems with the new setup.

Thanks,
Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: [MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
)
  [copy] at 
org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:302)
  [copy] at 
org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:221)
  [copy] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
  [copy] at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown 
Source)
  [copy] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  [copy] at java.lang.reflect.Method.invoke(Method.java:597)
  [copy] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
  [copy] at org.apache.tools.ant.Task.perform(Task.java:348)
  [copy] at 
org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:68)
  [copy] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
  [copy] at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown 
Source)
  [copy] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  [copy] at java.lang.reflect.Method.invoke(Method.java:597)
  [copy] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
  [copy] at org.apache.tools.ant.Task.perform(Task.java:348)
  [copy] at org.apache.tools.ant.Target.execute(Target.java:390)
  [copy] at 
org.apache.tools.ant.Target.performTasks(Target.java:411)
  [copy] at 
org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
  [copy] at 
org.apache.tools.ant.Project.executeTarget(Project.java:1368)
  [copy] at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
  [copy] at 
org.apache.tools.ant.Project.executeTargets(Project.java:1251)
  [copy] at org.apache.tools.ant.Main.runBuild(Main.java:809)
  [copy] at org.apache.tools.ant.Main.startAnt(Main.java:217)
  [copy] at 
org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
  [copy] at 
org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)



On 08/05/12 10:31, Steven A Rowe wrote:
 If you use the Lucene/Solr Maven POMs to drive the build, I committed a major 
 change last night (see https://issues.apache.org/jira/browse/LUCENE-3948 for 
 more details):

 * 'ant get-maven-poms' no longer places pom.xml files under the lucene/ and 
 solr/ directories.  Instead, they are placed in a new top-level directory: 
 maven-build/.

 * When you run 'mvnwhatever' under maven-build/, build and test output now 
 goes under the conventional Maven target/ directories associated with each 
 module's POM under the top-level maven-build/ directory.  Maven build and 
 test outputs are now completely separate from those produced by the Ant build.

 The above changes don't affect the 'ant generate-maven-artifacts' 
 process - the top-level maven-build/ directory is not involved.  
 (Instead, the 'generate-maven-artifacts' target calls a separate 
 target - 'filter-pom-templates' - to copy the POMs to 
 lucene/build/poms/ and interpolate their versions.)

 Please let me know if you run into problems with the new setup.

 Thanks,
 Steve


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Highlighter and Shingles...

2012-04-20 Thread Steven A Rowe
Hi Dawn,

Can you give an example of a partial match?

Steve

-Original Message-
From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] 
Sent: Friday, April 20, 2012 7:59 AM
To: java-user@lucene.apache.org
Subject: Highlighter and Shingles...

Hi,

Are there any notes on making the highlighter work consistently with a shingle 
generated index?
I have a situation where complete matches highlight OK, but partial matches do 
not - leading to a number of blank previews...

Our analyser look like:

 TokenStream result =
 new StopFilter(Version.LUCENE_36,
 new ShingleFilter(
 new StopFilter(Version.LUCENE_36,
 new LowerCaseFilter(Version.LUCENE_36,
 new StandardFilter(Version.LUCENE_36,
 new 
StandardTokenizer(Version.LUCENE_36, reader)
 )
 ),
 STOP_CHARS_SET)
 ),
 STOP_WORDS_SET);

-- 

Rgds.
*Dawn Raison*


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Two questions on RussianAnalyzer

2012-04-19 Thread Steven A Rowe
Hi Vladimir,

 The most uncomfortable in new behaviour to me is that in past I used
 to search by subdomain like bbb.com: and have displayed results
 with www.bbb.com:, aaa.bbb.com: and so on. Now I have 0
 results.

About domain names, see my response to a similar question today on the Solr 
users list: http://markmail.org/message/3ddxwc7dunblthyt. 

Steve



RE: Partial word match

2012-04-09 Thread Steven A Rowe
Hi Hanu,

Depending on the nature of the partial word match you're looking for - do you 
want to only match partial words that match at the beginning of the word? - you 
should look either at NGramTokenFilter or EdgeNGramTokenFilter:

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html

Steve

-Original Message-
From: hanu_bhambi [mailto:hanu.bha...@comviva.com] 
Sent: Monday, April 09, 2012 6:31 AM
To: java-user@lucene.apache.org
Subject: Partial word match

Is it possible to match partial words using Lucene. we are using Standard 
Analyzer for tokenization.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-word-match-tp3896450p3896450.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
Hi okayndc,

What *do* you want?

Steve

-Original Message-
From: okayndc [mailto:bodymo...@gmail.com] 
Sent: Thursday, April 05, 2012 1:34 PM
To: java-user@lucene.apache.org
Subject: HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more current 
version soon.
The problem that I have is when I test search for a an HTML tag (ex.
strong), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to 
filter HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if 
this is the way to go?

Any help will be greatly appreciated,
Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
okayndc,

A field configured to use HTMLStripCharFilter as part of its index-time 
analyzer will strip out HTML tags before index terms are created by the 
tokenizer, so HTML tags will not be put into the index.  As a result, queries 
for HTML tags cannot match the original documents' HTML tags (in the field 
configured to use HTMLStripCharFilter, anyway).

So HTMLStripCharFilter should do what you want.

Steve

From: okayndc [mailto:bodymo...@gmail.com]
Sent: Thursday, April 05, 2012 3:36 PM
To: Steven A Rowe
Cc: java-user@lucene.apache.org
Subject: Re: HTML tags and Lucene highlighting

Hello,

I want to ignore HTML tags within a search.  ~ I should not be able to search 
for a HTML tag (ex. strong) and get back the highlighted HTML tag (ex. span 
class=highlightedstrong/span) in a result set.

Thanks

On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe 
sar...@syr.edumailto:sar...@syr.edu wrote:
Hi okayndc,

What *do* you want?

Steve

-Original Message-
From: okayndc [mailto:bodymo...@gmail.commailto:bodymo...@gmail.com]
Sent: Thursday, April 05, 2012 1:34 PM
To: java-user@lucene.apache.orgmailto:java-user@lucene.apache.org
Subject: HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more current 
version soon.
The problem that I have is when I test search for a an HTML tag (ex.
strong), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to 
filter HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if 
this is the way to go?

Any help will be greatly appreciated,
Thanks
-
To unsubscribe, e-mail: 
java-user-unsubscr...@lucene.apache.orgmailto:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-h...@lucene.apache.orgmailto:java-user-h...@lucene.apache.org



RE: Lucene tokenization

2012-03-27 Thread Steven A Rowe
Hi Nilesh,

Which version of Lucene are you using?  StandardTokenizer behavior changed in 
v3.1.

Steve

-Original Message-
From: Nilesh Vijaywargiay [mailto:nilesh.vi...@gmail.com] 
Sent: Tuesday, March 27, 2012 2:04 PM
To: java-user@lucene.apache.org
Subject: Lucene tokenization

I have a string 01a_b-_-c-d which is tokenized as 01a_b c d

and the string a_b-_-c_d which is tokenized as a b c d

why is there a difference when there is a digit at the beginning? I am using 
standard unstemmed tokenizer.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the 
tokens it emits, in which case your index will have things like sentence. and 
sentence? in it, so querying for sentence will not match.

Luke can tell you what's in your index: http://code.google.com/p/luke/

Steve

-Original Message-
From: Ilya Zavorin [mailto:izavo...@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some 
simple docs and querries. I have 3 simples docs that are shown at the bottom of 
the this email between pairs of ===s and about a dozen terms. 
One of them is electricity. As you can see, it appears in all three docs. 
However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or 
Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is sentence. I 
have a bunch of other querries that only appear in one of the three docs and 
these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when 
indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = sentence;
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = 
results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an 
NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about 
a bean-bag toss game they play after practices, then mock a teammate who has 
inexplicably decided to do an interview naked. Music thumps. Giant men laugh, 
and their laughter rattles off cinder block walls in the symphony of a football 
team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to 
leave. It's a strange image, loaded with contrasts. He doesn't belong here, not 
with these men, many of whom are almost 10 years younger than him. And yet he 
feels very much at home. He isn't the star on this team, which is two wins from 
the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive 
bowling ball of a man who in the spirit of running backs relishes the chance to 
run the ball 25 times a game. Williams is an afterthought, a backup who has 
carried the ball more than 12 times in only one game this season. Often he 
might have the ball in his hands on only four or five plays, and this is fine 
with him. In fact he prefers it. His body has absorbed enough beatings for one 
lifetime. Let someone else get the pain.

electricity


===

Doc 2:
===
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed 
every day by countless thousands around the globe and yet I have never heard 
even one remotely legitimate answer. How much wood would a woodchuck chuck if a 
woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a 
woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting 
to know why she sells seashells by the seashore.

common term is electricity


===

Doc 3:
===
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the 
right to hold the earliest presidential primary, fending off bigger states that 
claimed that the small New England state was too white to represent the 
nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a 
presidential candidate until they've met at least three times face-to-face _ 
rather than seeing the person in television ads or at large events typical of 
bigger states. New Hampshire voters expect to shake hands with candidates at 
coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former 
Georgia governor. Jimmy Carter won in New Hampshire and went on to become 
president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United 
America. And Here's another one: States america. And here's yet another == 
UNITED STATES! Here we are dropping the middle stopword: United States  
America. Finally, we get one word: united. Then the second one: STates. 
Then the final one: America.

===


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For 

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Ilya,

StandardAnalyzer treats all forms of newline as whitespace, and doesn't join 
tokens across whitespace.  Can you look at your original text using a hex 
editor (or something like it, e.g. Unix od)?  Check which character is 
actually inbetween electricity and this, and pain. and electricity in 
the original text.

Are you sure that these files were analyzed with StandardAnalyzer, and not some 
other language-specific analyzer, as a result of language misidentification?

Steve

-Original Message-
From: Ilya Zavorin [mailto:izavo...@caci.com] 
Sent: Monday, March 26, 2012 11:21 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Steve,

Thanks much for the link: very useful!

I looked at the index and found that it contains terms like

electricitythis -- from Doc 3
pain.electricity -- from Doc 1

sentence.he -- from Doc 1

It appears that there is some sort of issue with handling end-of-lines. What do 
I need to change at index time for this to work properly?


Not sure whether this is relevant, but the text files has been saved as UTF8 
even though they are ASCII. I need to handle foreign text so I assume all files 
that I index are UTF8.

I am using the standard analyzer for English text and other contributed 
analyzers for respective foreign texts


Thanks,

Ilya


-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: Monday, March 26, 2012 10:59 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the 
tokens it emits, in which case your index will have things like sentence. and 
sentence? in it, so querying for sentence will not match.

Luke can tell you what's in your index: http://code.google.com/p/luke/

Steve

-Original Message-
From: Ilya Zavorin [mailto:izavo...@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some 
simple docs and querries. I have 3 simples docs that are shown at the bottom of 
the this email between pairs of ===s and about a dozen terms. 
One of them is electricity. As you can see, it appears in all three docs. 
However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or 
Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is sentence. I 
have a bunch of other querries that only appear in one of the three docs and 
these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when 
indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = sentence;
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = 
results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an 
NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about 
a bean-bag toss game they play after practices, then mock a teammate who has 
inexplicably decided to do an interview naked. Music thumps. Giant men laugh, 
and their laughter rattles off cinder block walls in the symphony of a football 
team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to 
leave. It's a strange image, loaded with contrasts. He doesn't belong here, not 
with these men, many of whom are almost 10 years younger than him. And yet he 
feels very much at home. He isn't the star on this team, which is two wins from 
the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive 
bowling ball of a man who in the spirit of running backs relishes the chance to 
run the ball 25 times a game. Williams is an afterthought, a backup who has 
carried the ball more than 12 times in only one game this season. Often he 
might have the ball in his hands on only four or five plays, and this is fine 
with him. In fact he prefers it. His body has absorbed enough beatings for one 
lifetime. Let someone else get the pain.

electricity


===

Doc 2:
===
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed 
every day by countless thousands around the globe and yet I have never heard 
even one remotely legitimate answer. How much wood would a woodchuck chuck if a 
woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a 
woodchuck could chuck

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote:
 I am not seeing anything suspicious. Here's what I see in the HEX:

 n.e from pain.electricity: 6E-2E-0D-0A-0D-0A-65
 (n-.-CR-LF-CR-LF-e) e.H from sentence.He: 65-2E-0D-0A-48

I agree, standard DOS/Windows line endings.

 I am pretty sure I am using the std analyzer

Interesting.  I'm quite sure something else is going on besides 
StandardAnalyzer, since StandardAnalyzer (more specifically, StandardTokenizer) 
always breaks tokens on whitespace, and excludes punctuation at the end of 
tokens.  In case you're interested, the standard to which StandardTokenizer 
(v3.1 - v3.5) conforms is the Word Boundaries rules from Unicode 6.0.0 standard 
annex #29 aka UAX#29: 
http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries.

Can you share the code where you construct your analyzer and IndexWriterConfig?

 Here's how I add a doc to the index (oc is String containing the whole 
 document):

 doc.add(new Field(contents, 
   oc, 
   Field.Store.YES,
   Field.Index.ANALYZED, 
   Field.TermVector.WITH_POSITIONS_OFFSETS));

 Can this affect the indexing?

The way you add the Field looks fine.

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
https://builds.apache.org//job/Lucene-trunk/lastSuccessfulBuild/artifact/artifacts/changes/Changes.html

-Original Message-
From: Benson Margulies [mailto:bimargul...@gmail.com] 
Sent: Monday, March 05, 2012 11:11 AM
To: java-user@lucene.apache.org
Subject: Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

On Mon, Mar 5, 2012 at 11:07 AM, Steven A Rowe sar...@syr.edu wrote:
 The second item in the top section in trunk CHANGES.txt (back compat policy 
 changes):

Could you guys put this on the web site (or a link to it)? Or try to get it to 
SEO more prominently?


 * LUCENE-2858, LUCENE-3733: IndexReader was refactored into abstract
  AtomicReader, CompositeReader, and DirectoryReader. To open 
 Directory-
  based indexes use DirectoryReader.open(), the corresponding method in
  IndexReader is now deprecated for easier migration. Only 
 DirectoryReader
  supports commits, versions, and reopening with openIfChanged(). 
 Terms,
  postings, docvalues, and norms can from now on only be retrieved 
 using
  AtomicReader; DirectoryReader and MultiReader extend CompositeReader,
  only offering stored fields and access to the sub-readers (which may 
 be
  composite or atomic). SlowCompositeReaderWrapper (LUCENE-2597) can be
  used to emulate atomic readers on top of composites.
  Please review MIGRATE.txt for information how to migrate old code.
  (Uwe Schindler, Robert Muir, Mike McCandless)

 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Monday, March 05, 2012 10:54 AM
 To: java-user@lucene.apache.org
 Subject: What replaces IndexReader.openIfChanged in Lucene 4.0?

 Sorry, I'm coming up empty in Google here.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
You want the lucene-queryparser jar.  From trunk MIGRATE.txt:

* LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been 
consolidated into module/queryparser,
  where other QueryParsers from the codebase will also be placed.  The 
following classes were moved:
  - o.a.l.queryParser.CharStream - o.a.l.queryparser.classic.CharStream
  - o.a.l.queryParser.FastCharStream - o.a.l.queryparser.classic.FastCharStream
  - o.a.l.queryParser.MultiFieldQueryParser - 
o.a.l.queryparser.classic.MultiFieldQueryParser
  - o.a.l.queryParser.ParseException - o.a.l.queryparser.classic.ParseException
  - o.a.l.queryParser.QueryParser - o.a.l.queryparser.classic.QueryParser
  - o.a.l.queryParser.QueryParserBase - 
o.a.l.queryparser.classic.QueryParserBase
  - o.a.l.queryParser.QueryParserConstants - 
o.a.l.queryparser.classic.QueryParserConstants
  - o.a.l.queryParser.QueryParserTokenManager - 
o.a.l.queryparser.classic.QueryParserTokenManager
  - o.a.l.queryParser.QueryParserToken - o.a.l.queryparser.classic.Token
  - o.a.l.queryParser.QueryParserTokenMgrError - 
o.a.l.queryparser.classic.TokenMgrError


-Original Message-
From: Benson Margulies [mailto:bimargul...@gmail.com] 
Sent: Monday, March 05, 2012 11:15 AM
To: java-user@lucene.apache.org
Subject: Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

To reduce noise slightly I'll stay on this thread.

I'm looking at this file, and not seeing a pointer to what to do about 
QueryParser. Are jar file rearrangements supposed to be in that file?
I think that I don't have the right jar yet; all I'm seeing is the 'surround' 
package.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Customizing indexing of large files

2012-02-27 Thread Steven A Rowe
PatternReplaceCharFilter would probably work, or maybe a custom CharFilter?  
*CharFilter has the advantage of preserving original text offsets, for 
highlighting.

Steve

 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com]
 Sent: Monday, February 27, 2012 12:57 PM
 To: java-user@lucene.apache.org
 Subject: Re: Customizing indexing of large files
 
 Hi,
 
 Understood.
 Write a custom FileReader that filters out the text you do not want.
 This will do it streaming.
 
 Glen
 
 On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande
 praka...@altair.com wrote:
  Hi,
 
  Description is multiline, in addition there is other text also. So,
 essentially what I need id to jump the DATA_END as soon as I hit
 DATA_BEGIN.
 
  I am creating the field using the constructor Field(String name, Reader
 reader) and using StandardAnalyser. Right now I am using FileReader which
 is causing all the text to be indexed/tokenized.
 
  Amount of text I am interested in is also pretty large, description is
 just one such example. So, I really want some stream based implementation
 to avoid keeping large amount of text in memory. May be a custom
 TokenStream, but I don't know what to implement in tokenstream. The only
 abstract method is incrementToken, I have no idea what to do in it.
 
  Regards,
 
  Prakash Bande
  Director - Hyperworks Enterprise Software
  Altair Eng. Inc.
  Troy MI
  Ph: 248-614-2400 ext 489
  Cell: 248-404-0292
 
  -Original Message-
  From: Glen Newton [mailto:glen.new...@gmail.com]
  Sent: Monday, February 27, 2012 12:05 PM
  To: java-user@lucene.apache.org
  Subject: Re: Customizing indexing of large files
 
  I'd suggest writing a perl script or
  insert-favourite-scripting-language-here script to pre-filter this
  content out of the files before it gets to Lucene/Solr
  Or you could just grep for Data' andDescription (or is
  'Description' multi-line)?
 
  -Glen Newton
 
  On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
  praka...@altair.com wrote:
  Hi,
 
  I want to customize the indexing of some specific kind of files I have.
 I am using 2.9.3 but upgrading is possible.
  This is how my file's data looks
 
  *
  Data for 2010
  Description: This section has a general description of the data.
  DATA_BEGIN
  Month       P1          P2          P3
  01          3243.433    43534.324   45345.2443
  02          3242.324    234234.24   323.2343
  ...
  ...
  ...
  ...
  DATA_END
  Data for 2011
  Description: This section has a general description of the data.
  DATA_BEGIN
  Month       P1          P2          P3
  01          3243.433    43534.324   45345.2443
  02          3242.324    234234.24   323.2343
  ...
  ...
  ...
  ...
  DATA_END
  *
 
  I would like to use a StandardAnalyser, but do not want to index the
 data of the columns, i.e. skip all those numbers. Basically, as soon as I
 hit the keyword DATA_BEGIN, I want to jump to DATA_END.
  So, what is the best approach? Using a custom Reader, custom tokenizer
 or some other mechanism.
  Regards,
 
  Prakash Bande
  Altair Eng. Inc.
  Troy MI
  Ph: 248-614-2400 ext 489
  Cell: 248-404-0292
 
 
 
 
  --
  -
  http://zzzoot.blogspot.com/
  -
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 --
 -
 http://zzzoot.blogspot.com/
 -
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: StandardAnalyzer and Email Addresses

2012-02-26 Thread Steven A Rowe
There is no Analyzer implementation because no one ever made one :).  
Copy-pasting StandardAnalyzer and substituting UAX29URLEmailTokenizer wherever 
StandardTokenizer appears should do the trick.

Because people often want to be able to search against *both* whole email 
addresses and URLs *and* their components, a UAX29URLEmailAnalyzer would 
ideally have filter(s) to emit email/URL components at the same position as the 
full term.  Or rather, the reverse: each component would have its own position, 
and the full term would be positioned at the head component's position.

Steve

 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Sunday, February 26, 2012 3:51 AM
 To: java-user@lucene.apache.org
 Subject: RE: StandardAnalyzer and Email Addresses
 
 Hi,
 
 If you want a Tokenizer for your Analyzer that supports eMail detection,
 use
 UAX29URLEmailTokenizer (see http://goo.gl/evH97). There is no Analyzer
 available that uses this Tokenizer, but you can define your own one like
 StandardAnalyzer, but with this class as Tokenizer (not
 StandardTokenizer).
 I am not sure why there is no Analyzer implementation already available,
 maybe Steven Rowe knows more.
 
 The trick with the phrase is of lower performance as it uses a PhraseQuery
 internally, which is more expensive.
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Charlie Hubbard [mailto:charlie.hubb...@gmail.com]
  Sent: Sunday, February 26, 2012 1:51 AM
  To: java-user@lucene.apache.org
  Subject: Re: StandardAnalyzer and Email Addresses
 
  I am using StandardAnalyzer in 3.1.  I'd been previously using 2.4 and
 from that
  documentation it states email address are recognized:
 
  http://javasourcecode.org/html/open-source/lucene/lucene-
  2.4.0/org/apache/lucene/analysis/standard/StandardTokenizer.html
 
  It looks like this was changed in 3.x according to this doc now:
 
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/or
 g/
  apache/lucene/analysis/standard/ClassicTokenizer.html
 
  I think I've found a work around in that if I search for email address
 like:
 
  to:charlie.hubb...@gmail.com
 
  Then it will look for the full email address.  What is the draw back of
 using the
  quoted version?  Is the performance worse doing this?  How much worse?
 I'm
  not sure how quoted searches are implemented so it's hard for me to
 gauge
  what the draw back is.
 
  Thanks
  Charlie
 
  On Mon, Feb 20, 2012 at 12:23 PM, Ian Lea ian@gmail.com wrote:
 
   Are you using StandardAnalyzer in 3.1+?  You may want to use
   ClassicAnalyzer instead.  I can't see where in the 3.5 javadocs it
   says that email addresses are recognized, but it does sound vaguely
   familiar.
  
  
   --
   Ian.
  
  
   On Thu, Feb 16, 2012 at 5:18 PM, Charlie Hubbard
   charlie.hubb...@gmail.com wrote:
This is a pretty simple question to answer, but I have customers
asking
   me
how this is suppose to work and I'm having trouble explaining it.  I
have an app that indexes emails so there are plenty of email
addresses in
   there.
 Reading the StandardAnalyzer javadoc it says it recognizes email
addresses when it is creating the token list.  What tokens will it
   produce
exactly?  What I'm seeing when I perform searches is the email
address looks like its being tokenized into its parts.  Searching by
an email address like:
   
to:charlie.hubb...@gmail.com
   
pulls back more hits that haven't been addressed to
charlie.hubb...@gmail.com.  Other messages with gmail.com in them
are returned.  If I use the following:
   
to:charlie.hubbard
   
in them.  It also finds gmail.com, and other domains.  And I can
search
   for
strings like
   
to:charlie.hubb...@gmail.com
   
it will pull back only emails addressed to that address.  Further
proof
   it
seems to token the parts of an email is if I search for a very
specific email address like:
   
to:charlie.hubbard+sometag
   
That will pull back only emails addressed to that email, but it's
not a full email address.  Which leads me to think it will parse
parts of the email addresses.  Can someone explain this a little
 more?
   
I'm having trouble with some emails that can't be pulled back using
the username like searching for to:chubbard where the email was
addressed to chubb...@somedomain.com, but it fails to show up in the
  search results.
I
can't explain why that's happening.  In all of my tests I can't
reproduce it and I think I might have to reindex everything because
this was an
   index
built with 2.4 and I upgraded to 3.1 so I'm worried it might be
   corrupted.
   
Thoughts?
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

RE: Can I just add ShingleFilter to my nalayzer used for indexing and searching

2012-02-21 Thread Steven A Rowe
Hi Paul,

Lucene QueryParser splits on whitespace and then sends individual words 
one-by-one to be analyzed.  All analysis components that do their work based on 
more than one word, including ShingleFilter and SynonymFilter, are borked by 
this.  (There is a JIRA issue open for the QueryParser problem: 
https://issues.apache.org/jira/browse/LUCENE-2605).  

There is a workaround involving PositionFilter described on the Solr wiki: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory.
  Essentially, include PositionFilter after ShingleFilter in your analyzer, 
then wrap queries in quotes before sending them to QueryParser.

CommonGramsFilter does the emit-only-shingles-containing-stopwords thing, but 
in Lucene/Solr 3.x, it's in Solr (solr-core-3.X.jar, to be exact), not Lucene; 
you can use it in your application by including the solr-core jar as a 
dependency.  In trunk, which will be released as Lucene/Solr 4.0, 
CommonGramsFilter has been moved to the analyzers-common module.

Steve

 -Original Message-
 From: Paul Taylor [mailto:paul_t...@fastmail.fm]
 Sent: Tuesday, February 21, 2012 8:07 AM
 To: java-user@lucene.apache.org
 Subject: Can I just add ShingleFilter to my nalayzer used for indexing and
 searching
 
 Trying out ShingleFIlter and the way it is documented it implys that you
 can just add it to your anaylzer and that's it with no side-effects
 except a larger index, but I read other implying you have to modify the
 way you parse user queries, could anyone confirm/deny.
 
 Also is there an easy way to use a ShingleFilter only for common stop
 words, or is that pointless.
 
 Paul
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Maven repository for lucene trunk

2012-02-14 Thread Steven A Rowe
Hi Sudarshan,

I think this wiki page has the info you want:

http://wiki.apache.org/lucene-java/HowNightlyBuildsAreMade

Steve

 -Original Message-
 From: sudarsh...@gmail.com [mailto:sudarsh...@gmail.com] On Behalf Of
 Sudarshan Gaikaiwari
 Sent: Tuesday, February 14, 2012 10:01 PM
 To: java-user@lucene.apache.org
 Subject: Maven repository for lucene trunk
 
 HI
 
 I would like to add dependencies on the lucene trunk in my maven project.
 
 Maven central does not seem to have the trunk artifacts.
 http://search.maven.org/#search%7Cga%7C1%7Clucene
 
 Is there a maven repository with the lucene trunk jars. I would prefer to
 add a dependency on such a repository instead of adding these jars
 locally.
 
 Thanks
 
 
 --
 Sudarshan Gaikaiwari
 www.sudarshan.org
 sudars...@acm.org


RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Hi Damerian,

One way to handle your scenario is to hold on to the previous token, and only 
emit a token after you reach at least the second token (or at end-of-stream).  
Your incrementToken() method could look something like:

1. Get current attributes: input.incrementToken()
2. If previous token does not exist:
  2a. Store current attributes as previous token (see 
AttributeSource#cloneAttributes)
2b. Get current attributes: input.incrementToken()
3. Check for  store conditions that will affect previous token's attributes
4. Store current attributes as next token (see AttributeSource#cloneAttributes)
5. Copy previous token into current attributes (see AttributeSource#copyTo);
   the target will be this, which is an AttributeSource.
6. Make changes based on conditions found in step #3 above
7. set previous token = next token
8. return true

(Everywhere I say token I mean instance of AttributeSource.)

The final token in the input stream will need special handling, as will 
single-token input streams.

Good luck,
Steve

 -Original Message-
 From: Damerian [mailto:dameria...@gmail.com]
 Sent: Thursday, February 09, 2012 2:19 PM
 To: java-user@lucene.apache.org
 Subject: Access next token in a stream
 
 Hello i want to implement my custom filter, my wuestion is quite simple
 but i cannot find a solution to it no matter how i try:
 
 How can i access the TermAttribute of the  next token than the one i
 currently have in my stream?
 
 For example in  the phrase My name is James Bond if let's say i am in
 the token [My], i would like to be able to check the TermAttribute of
 the following token [name] and fix my position increment accordingly.
 
 Thank you in advance!
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Damerian,

The technique I mentioned would work for you with a little tweaking: when you 
see consecutive capitalized tokens, then just set the CharTermAttribute to the 
joined tokens, and clear the previous token.

Another idea: you could use ShingleFilter with min size = max size = 2, and 
then use a following Filter extending FilteringTokenFilter, with an accept() 
method that examines shingles and rejects ones that don't qualify, something 
like the following.  (Notes: this is untested; I assume you will use the 
default shingle token separator  ; and this filter will reject all 
non-shingle terms, so you won't get anything but names, even if you configure 
ShingleFilter to emit single tokens):

public final class MyNameFilter extends FilteringTokenFilter {
  private static final Pattern NAME_PATTERN 
  = Pattern.compile(\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+);
  private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
  @Override public boolean accept() throws IOException {
return NAME_PATTERN.matcher(termAtt).matches();
  }
}

Steve

 -Original Message-
 From: Damerian [mailto:dameria...@gmail.com]
 Sent: Thursday, February 09, 2012 4:15 PM
 To: java-user@lucene.apache.org
 Subject: Re: Access next token in a stream
 
 Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
  Hi Damerian,
 
  One way to handle your scenario is to hold on to the previous token, and
 only emit a token after you reach at least the second token (or at end-of-
 stream).  Your incrementToken() method could look something like:
 
  1. Get current attributes: input.incrementToken()
  2. If previous token does not exist:
 2a. Store current attributes as previous token (see
 AttributeSource#cloneAttributes)
  2b. Get current attributes: input.incrementToken()
  3. Check for  store conditions that will affect previous token's
 attributes
  4. Store current attributes as next token (see
 AttributeSource#cloneAttributes)
  5. Copy previous token into current attributes (see
 AttributeSource#copyTo);
  the target will be this, which is an AttributeSource.
  6. Make changes based on conditions found in step #3 above
  7. set previous token = next token
  8. return true
 
  (Everywhere I say token I mean instance of AttributeSource.)
 
  The final token in the input stream will need special handling, as will
 single-token input streams.
 
  Good luck,
  Steve
 
  -Original Message-
  From: Damerian [mailto:dameria...@gmail.com]
  Sent: Thursday, February 09, 2012 2:19 PM
  To: java-user@lucene.apache.org
  Subject: Access next token in a stream
 
  Hello i want to implement my custom filter, my wuestion is quite simple
  but i cannot find a solution to it no matter how i try:
 
  How can i access the TermAttribute of the  next token than the one i
  currently have in my stream?
 
  For example in  the phrase My name is James Bond if let's say i am in
  the token [My], i would like to be able to check the TermAttribute of
  the following token [name] and fix my position increment accordingly.
 
  Thank you in advance!
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 Hi Steve,
 Thank you for your immediate reply. i will try your solution but i feel
 that it does not solve my case.
 What i am trying to make is a filter that joins together two
 terms/tokens that start with a capital letter (it is trying to find all
 the Names/Surnames and make them one token)  so in my aforementioned
 example when i examine [James] even if i store the TermAttribute to a
 temporary token how can i check the next one [Bond] , to join them
 without actually emmiting (and therefore creating a term in my inverted
 index) that has [James] on its own.
 Thank you again for your insight and i would relly appreciate any other
 views on the matter.
 
 Regards, Damerian
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Damerian,

When I said clear the previous token, I was referring to the pseudo-code I 
gave in my first response to you.  There is no built-in method to do that.  If 
you want to conditionally output tokens, you should store AttributeSource 
clones, as in my pseudo-code.

Steve

 -Original Message-
 From: Damerian [mailto:dameria...@gmail.com]
 Sent: Thursday, February 09, 2012 5:00 PM
 To: java-user@lucene.apache.org
 Subject: Re: Access next token in a stream
 
 Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
  Damerian,
 
  The technique I mentioned would work for you with a little tweaking:
 when you see consecutive capitalized tokens, then just set the
 CharTermAttribute to the joined tokens, and clear the previous token.
 
  Another idea: you could use ShingleFilter with min size = max size = 2,
 and then use a following Filter extending FilteringTokenFilter, with an
 accept() method that examines shingles and rejects ones that don't
 qualify, something like the following.  (Notes: this is untested; I assume
 you will use the default shingle token separator  ; and this filter will
 reject all non-shingle terms, so you won't get anything but names, even if
 you configure ShingleFilter to emit single tokens):
 
  public final class MyNameFilter extends FilteringTokenFilter {
 private static final Pattern NAME_PATTERN
 = Pattern.compile(\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+);
 private final CharTermAttribute termAtt =
 addAttribute(CharTermAttribute.class);
 @Override public boolean accept() throws IOException {
   return NAME_PATTERN.matcher(termAtt).matches();
 }
  }
 
  Steve
 
  -Original Message-
  From: Damerian [mailto:dameria...@gmail.com]
  Sent: Thursday, February 09, 2012 4:15 PM
  To: java-user@lucene.apache.org
  Subject: Re: Access next token in a stream
 
  Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
  Hi Damerian,
 
  One way to handle your scenario is to hold on to the previous token,
 and
  only emit a token after you reach at least the second token (or at end-
 of-
  stream).  Your incrementToken() method could look something like:
  1. Get current attributes: input.incrementToken()
  2. If previous token does not exist:
  2a. Store current attributes as previous token (see
  AttributeSource#cloneAttributes)
2b. Get current attributes: input.incrementToken()
  3. Check for   store conditions that will affect previous token's
  attributes
  4. Store current attributes as next token (see
  AttributeSource#cloneAttributes)
  5. Copy previous token into current attributes (see
  AttributeSource#copyTo);
   the target will be this, which is an AttributeSource.
  6. Make changes based on conditions found in step #3 above
  7. set previous token = next token
  8. return true
 
  (Everywhere I say token I mean instance of AttributeSource.)
 
  The final token in the input stream will need special handling, as
 will
  single-token input streams.
  Good luck,
  Steve
 
  -Original Message-
  From: Damerian [mailto:dameria...@gmail.com]
  Sent: Thursday, February 09, 2012 2:19 PM
  To: java-user@lucene.apache.org
  Subject: Access next token in a stream
 
  Hello i want to implement my custom filter, my wuestion is quite
 simple
  but i cannot find a solution to it no matter how i try:
 
  How can i access the TermAttribute of the  next token than the one i
  currently have in my stream?
 
  For example in  the phrase My name is James Bond if let's say i am
 in
  the token [My], i would like to be able to check the TermAttribute of
  the following token [name] and fix my position increment accordingly.
 
  Thank you in advance!
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
  Hi Steve,
  Thank you for your immediate reply. i will try your solution but i feel
  that it does not solve my case.
  What i am trying to make is a filter that joins together two
  terms/tokens that start with a capital letter (it is trying to find all
  the Names/Surnames and make them one token)  so in my aforementioned
  example when i examine [James] even if i store the TermAttribute to a
  temporary token how can i check the next one [Bond] , to join them
  without actually emmiting (and therefore creating a term in my inverted
  index) that has [James] on its own.
  Thank you again for your insight and i would relly appreciate any other
  views on the matter.
 
  Regards, Damerian
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 I think my solution in almost full now only one question you mentioned
 clear the previous token. . Is there a built-in method for doing that?
 In the begining i thought that if i put my new token

RE: Analysers for newspaper pages...

2011-11-28 Thread Steven A Rowe
Hi Dawn,

I assume that when you refer to the impact of stop words, you're concerned 
about query-time performance?  You should consider the possibility that 
performance without removing stop words is good enough that you won't have to 
take any steps to address the issue.

That said, there are two filters in Solr 3.X[1] that would do the equivalent of 
what you have outlined: CommonGramsFilter 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html
 and CommonGramsQueryFilter 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html.

You can use these filters with a Lucene 3.X application by including the 
(same-versioned) solr-core jar as a dependency.

Steve

[1] In Lucene/Solr trunk, which will be released as 4.0, these filters have 
been moved to a shared Lucene/Solr module.

 -Original Message-
 From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk]
 Sent: Monday, November 28, 2011 2:10 PM
 To: java-user@lucene.apache.org
 Subject: Analysers for newspaper pages...
 
 Hi folks,
 
 I'm researching the best options to use for analysing/storing newspaper
 pages in out online archive, and wondered if anyone has any good hints
 or tips on good practice for this type of media?
 
 I'm currently thinking alone the lines of using a customised
 StandardAnalyser (no stop words + extra date token detection) wrapped
 with a Shingle filter and finally a Stopword filter - the thinking being
 this should reduce the impact of stop words but still allow to be or
 not to be searches...
 
 A future aim is to add a synonym filter at search time.
 
 We currently have ~2.5million pages - some of the older broadsheet pages
 can have a serious number of tokens.
 We currently index using the SimpleAnalyser - a hangover from the
 previous developers I hope to remedy :-).
 
 --
 
 Rgds.
 *Dawn Raison*
 Technical Director, Digitorial Ltd.
 



RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul,

On 10/19/2011 at 5:26 AM, Paul Taylor wrote:
 On 18/10/2011 15:25, Steven A Rowe wrote:
  On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
   On 18/10/2011 06:19, Steven A Rowe wrote:
Another option is to create a char filter that substitutes
PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
etc.,
  
   Yes that is how I first did it
 
  No, I don't think you did.  When I say char filter I'm referring to
  CharFilter [snip]

 If you look at the code you can see I do use a CharFilter: [snip]

I apologize, you're obviously right, I hadn't looked at your code. 

  If you go with a CharFilter, you can give it access to the entire input
  at once, and use a regular expression (or something like it) to assess
  the input and then behave accordingly.

 Well this is the problem, you cant use a regular expression or even if
 you did would that really slow things down wouldn't it, seeing as 99%
 dont need the transformation.

PatternReplaceCharFilter might do the trick - maybe worth a test to see if it's 
performant enough?

Steve


RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul,

What version of Lucene are you using?  The JFlex spec you quote below looks 
pre-v3.1?

Steve

 -Original Message-
 From: Paul Taylor [mailto:paul_t...@fastmail.fm]
 Sent: Wednesday, October 19, 2011 6:50 AM
 To: Steven A Rowe; java-user@lucene.apache.org  'java-
 u...@lucene.apache.org'
 Subject: Re: How do you see if a tokenstream has tokens without consuming
 the tokens ?
 
 On 18/10/2011 05:19, Steven A Rowe wrote:
  Hi Paul,
 
  You could add a rule to the StandardTokenizer JFlex grammar to handle
 this case, bypassing its other rules.
 THis seemed to be working, just to test it out I changed the EMAIL one
 to this
 
 EMAIL =  (!|*|^|!|.|@|%|♠|\)+
 
 And changed the order the tokens were checked
 
 %%
 
 {ALPHANUM} { return
 ALPHANUM; }
 {APOSTROPHE}   { return
 APOSTROPHE; }
 {ACRONYM}  { return
 ACRONYM; }
 {COMPANY}  { return
 COMPANY; }
 {HOST} { return
 HOST; }
 {NUM}  { return
 NUM; }
 {CJ}   { return
 CJ; }
 {ACRONYM_DEP}  { return
 ACRONYM_DEP; }
 {EMAIL}{ return
 EMAIL; }
 
 /** Ignore the rest */
 . | {WHITESPACE}   { /*
 ignore */ }
 
 
 So then if I passed !!!' to the tokenizer, it kept it which was exactly
 what I wanted
 
 However if I passed it 'fred!!!' it  split it into two tokens
 
 'fred' and '!!!'
 
 which is not what I wanted, I just wanted to get back
 
 fred
 
 
 I tried chnaging EMAIL to
 
 EMAIL =  ^(!|*|^|!|.|@|%|♠|\)+
 
 but use of ^ and $ seem to be disallowed, so I cant see if there is
 anyway to do what I want in the jflex, if thats the case can I drop the
 2nd filter somehow in a subsequent filter ?
 
 
 Paul
 
 
 
 
 
 
 
  Another option is to create a char filter that substitutes PUNCT-
 EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but
 only when the entire input consists exclusively of whitespace and
 punctuation.  These symbols would then be left intact by
 StandardTokenizer.
 
  Steve
 
  -Original Message-
  From: Paul Taylor [mailto:paul_t...@fastmail.fm]
  Sent: Monday, October 17, 2011 8:13 AM
  To: 'java-user@lucene.apache.org'
  Subject: How do you see if a tokenstream has tokens without consuming
 the
  tokens ?
 
 
  We have a modified version of a Lucene StandardAnalyzer , we use it
 for
  tokenizing music metadata such as as artist names  song titles, so
  typically only a few words. On tokenizing it usually it strips out
  punctuations which is correct, however if the input text consists of
  only punctuation characters then we end up with nothing, for these
  particular RARE cases I want to use a mapping filter.
 
  So what I try to do is have my analyzer tokenize as normal, then if
 the
  results is no tokens retokenize with the mapping filter , I check it
 has
  no token using incrementToken() but then cant see how I
  decrementToken(). How can I do this, or is there a more efficient way
 of
  doing this. Note of maybe 10,000,000 records only a few 100 records
 will
  have this problem so I need a solution which doesn't impact
 performance
  unreasonably.
 
NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
specialcharConvertMap.add(!, Exclamation);
specialcharConvertMap.add(?,QuestionMark);
...
 
public  TokenStream tokenStream(String fieldName, Reader reader)
 {
CharFilter specialCharFilter = new
  MappingCharFilter(specialcharConvertMap,reader);
 
StandardTokenizer tokenStream = new
  StandardTokenizer(LuceneVersion.LUCENE_VERSION);
try
{
if(tokenStream.incrementToken()==false)
{
tokenStream = new
  StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
}
else
{
//TODO  set tokenstream back as it
 was
  before increment token
}
}
catch(IOException ioe)
{
 
}
TokenStream result = new LowercaseFilter(result);
return result;
}
 
  thanks for any help
 
 
  Paul
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-18 Thread Steven A Rowe
Hi Paul,

On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
 On 18/10/2011 06:19, Steven A Rowe wrote:
  Another option is to create a char filter that substitutes
  PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
  etc.,
 
 Yes that is how I first did it

No, I don't think you did.  When I say char filter I'm referring to 
CharFilter 
http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html
 - this is a different kind of thing from the token filter approach you 
described taking previously.

Lucene Analyzers may be composed of three different kinds of components: 

* CharFilter: character-level filter; precedes the tokenizer; allows for 
character stream modifications while enabling original character offsets to be 
maintained (to enable e.g. highlighting).  Input: character stream; output: 
character stream.  An analyzer may contain zero or more of these.

* Tokenizer: identifies character sequences that will serve as (the basis of) 
indexable tokens.  Input: character stream; output: token stream. An analyzer 
must contain exactly one of these.

* TokenFilter: token-level filter; follows the Tokenizer; transforms, adds 
and/or removes tokens to/from the token stream.  Input: token stream; output: 
token stream.  An analyzer may contain zero or more of these.

  but only when the entire input consists exclusively of whitespace and
  punctuation.
 
 but I couldnt work out how to only do it when exclusively whitespace and
 punctuation, any ideas to sole that _

If you go with a CharFilter, you can give it access to the entire input at 
once, and use a regular expression (or something like it) to assess the input 
and then behave accordingly.

Steve



RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Steven A Rowe
Hi Paul,

You could add a rule to the StandardTokenizer JFlex grammar to handle this 
case, bypassing its other rules.

Another option is to create a char filter that substitutes PUNCT-EXCLAMATION 
for exclamation points, PUNCT-PERIOD for periods, etc., but only when the 
entire input consists exclusively of whitespace and punctuation.  These symbols 
would then be left intact by StandardTokenizer.

Steve

 -Original Message-
 From: Paul Taylor [mailto:paul_t...@fastmail.fm]
 Sent: Monday, October 17, 2011 8:13 AM
 To: 'java-user@lucene.apache.org'
 Subject: How do you see if a tokenstream has tokens without consuming the
 tokens ?
 
 
 We have a modified version of a Lucene StandardAnalyzer , we use it for
 tokenizing music metadata such as as artist names  song titles, so
 typically only a few words. On tokenizing it usually it strips out
 punctuations which is correct, however if the input text consists of
 only punctuation characters then we end up with nothing, for these
 particular RARE cases I want to use a mapping filter.
 
 So what I try to do is have my analyzer tokenize as normal, then if the
 results is no tokens retokenize with the mapping filter , I check it has
 no token using incrementToken() but then cant see how I
 decrementToken(). How can I do this, or is there a more efficient way of
 doing this. Note of maybe 10,000,000 records only a few 100 records will
 have this problem so I need a solution which doesn't impact performance
 unreasonably.
 
  NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
  specialcharConvertMap.add(!, Exclamation);
  specialcharConvertMap.add(?,QuestionMark);
  ...
 
  public  TokenStream tokenStream(String fieldName, Reader reader) {
  CharFilter specialCharFilter = new
 MappingCharFilter(specialcharConvertMap,reader);
 
  StandardTokenizer tokenStream = new
 StandardTokenizer(LuceneVersion.LUCENE_VERSION);
  try
  {
  if(tokenStream.incrementToken()==false)
  {
  tokenStream = new
 StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
  }
  else
  {
  //TODO  set tokenstream back as it was
 before increment token
  }
  }
  catch(IOException ioe)
  {
 
  }
  TokenStream result = new LowercaseFilter(result);
  return result;
  }
 
 thanks for any help
 
 
 Paul
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: setting MaxFieldLength in indexwriter

2011-09-28 Thread Steven A Rowe
Hi Peyman,

The API docs give a hint 
http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/index/IndexWriter.html:

=
Nested Class Summary
...
static class IndexWriter.MaxFieldLength
Deprecated. use LimitTokenCountAnalyzer instead.
=

http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/LimitTokenCountAnalyzer.html
 

Also, if you're composing your own Analysis pipeline, you'll likely be 
interested in the Filter variant of the above-linked Analyzer wrapper:

http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/LimitTokenCountFilter.html
 

Steve

 -Original Message-
 From: Peyman Faratin [mailto:pey...@robustlinks.com]
 Sent: Wednesday, September 28, 2011 9:08 AM
 To: java-user@lucene.apache.org
 Subject: setting MaxFieldLength in indexwriter
 
 Hi
 
 Newbie question. I'm trying to set the max field length property of the
 indexwriter to unlimited. The old api is now deprecated but I can't seem
 to be able to figure out how to set the field with the new
 (IndexWriterConfig) API. I've tried
 IndexWriterConfig.maxFieldLength(Integer.MAX_VALUE)  but to no avail. Any
 help would be much appreciated as always
 
 
 
 
   File file = new File(stopWordsFile);
   Directory dir = NIOFSDirectory.open(new File(indexDir));
   IndexWriterConfig conf = new
 IndexWriterConfig(Version.LUCENE_32,
   new StandardAnalyzer(Version.LUCENE_32,file));
 
   conf.maxFieldLength(Integer.MAX_VALUE) ;
 
   writer = new IndexWriter(dir, conf);
 
 thank you
 



RE: Enabling indexing of hyphenated terms sans the hyphen

2011-09-19 Thread Steven A Rowe
Hi sbs,

Solr's WordDelimiterFilterFactory does what you want.  You can see a 
description of its function here: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory.

WordDelimiterFilter, the filter class implementing the above factory's 
functionality, is package private in Solr 3.X, so unless you want to circumvent 
this access restriction (e.g. with introspection or a with façade class in the 
same package as the Solr filter class), you can't just depend on the v3.2 
solr-core jar, where it resides. In trunk (4.0, not yet released), 
WordDelimiterFilter has been moved to the analysis-common module and made 
public.

You can copy/paste WordDelimiterFilter.java into your project and use it 
without any additional dependencies beyond lucene-core.  Here's the source for 
the Lucene/Solr 3.2 version: 
http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_2/solr/src/java/org/apache/solr/analysis/WordDelimiterFilter.java.

Good luck,
Steve

 -Original Message-
 From: SBS [mailto:jturn...@uow.edu.au]
 Sent: Monday, September 19, 2011 4:27 PM
 To: java-user@lucene.apache.org
 Subject: Enabling indexing of hyphenated terms sans the hyphen
 
 We use StandardTokenizer and this works well but we also need to include
 terms in our index which consist of hyphenated terms with the hyphen
 removed.  So, for example, if the text being indexed contains self-
 induced
 we need the terms self, induced and selfinduced to be indexed.
 
 How would I go about implementing this?  We use Lucene Java 3.2.
 
 Thanks,
 
 -sbs
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Enabling-indexing-of-hyphenated-terms-
 sans-the-hyphen-tp3350008p3350008.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: 4.0-SNAPSHOT in maven repo via Jenkins?

2011-07-25 Thread Steven A Rowe
Hi Eric,

On 7/24/2011 at 3:07 AM, Eric Charles wrote:
0112233445566778
12345678901234567890123456789012345678901234567890123456789012345678901234567890
 Jenkins jobs builds lucene trunk with 'mvn --batch-mode
 --non-recursive -Pbootstrap install' [1]

Two things: a) This Jenkins job builds both Lucene and Solr; and b) the above 
non-recursive invocation with the 'bootstrap' profile is used to install the 
non-mavenized dependencies to the local repository -- the other two following 
invocations actually perform the build:

 Installing non-mavenized deps into the maven local repo
.../mvn --batch-mode --non-recursive -Pbootstrap install
[...]
 Clearing the Ant build output
.../mvn --batch-mode --fail-at-end clean
[...]
 Running the Maven build without tests
.../mvn --batch-mode --fail-at-end -DskipTests install

 Would it be possible to also invoke 'mvn deploy' to have the
 4.0-SNAPSHOT artifacts deployed in apache snapshot repository [2]

There is an open JIRA issue to do this - while it's nominally a Solr issue, the 
fix would apply to both Lucene and Solr:

https://issues.apache.org/jira/browse/SOLR-2634

FYI, in addition to exercising the Maven build using 'mvn install' and 'mvn 
test', the nightly Jenkins Maven jobs also run 'ant generate-maven-artifacts', 
and then publish the results.  See this wiki page for details, including how to 
refer to these published snapshot artifacts from your POM:

http://wiki.apache.org/lucene-java/%20HowNightlyBuildsAreMade

Steve



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Some question about Lucene

2011-07-10 Thread Steven A Rowe
This slide show is a few years old, but I think it might be a good introduction 
for you to the differences between the projects:

http://www.slideshare.net/dnaber/apache-lucene-searching-the-web-and-everything-else-jazoon07/

Steve

-Original Message-
From: Ing. Yusniel Hidalgo Delgado [mailto:yhdelg...@uci.cu] 
Sent: Sunday, July 10, 2011 9:30 PM
To: java-user@lucene.apache.org
Subject: Some question about Lucene


Hello 

I'm a new Lucene user. I have the following question: is posible to build a 
crawler/spider with Lucene library or Lucene is only for index/search phases. I 
am studying three project: Nutch, Lucene and Solr but I don't see what is the 
main difference between them. 

Greetings . 
-- 

 
Eng. Yusniel Hidalgo Delgado 
University of Informatics Sciences 

 


RE: how are built the packages in the maven repository?

2011-07-06 Thread Steven A Rowe
Ant is the official Lucene/Solr build system.  Snapshot and release artifacts 
are produced with Ant.

While Maven is capable of producing artifacts, the artifacts produced in this 
way may not be the same as the official Ant artifacts.  For this reason: no, 
the artifacts should not be built with Maven.

Sorry, I know nothing about OSGI.

Steve

-Original Message-
From: je...@vige.it [mailto:je...@vige.it] 
Sent: Wednesday, July 06, 2011 6:51 AM
To: java-user@lucene.apache.org
Subject: how are built the packages in the maven repository?

Hi I'm looking inside the jenkins maven repository. For example the package 
in 
https://builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/org/apache/lucene/lucene-misc/4.0-SNAPSHOT/lucene-misc-4.0-20110705.223250-1.jar
 seems to be built with ant instead of maven. It is visible from the MANIFEST.MF

Is it correct? should not be built with maven? Actually maven has the OSGI 
support respect to the ant version



 


--
Luca Stancapiano
javaee consultant 
website: www.vige.it
tel: 3381584484
skype: flashboss62


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene Simple Project

2011-06-18 Thread Steven A Rowe
Hi Hamada,

Do you know about the Lucene demo?:

http://lucene.apache.org/java/3_2_0/demo.html

Steve

 -Original Message-
 From: hamadazahera [mailto:hamadazah...@gmail.com]
 Sent: Saturday, June 18, 2011 9:30 AM
 To: java-user@lucene.apache.org
 Subject: Lucene Simple Project
 
 Hello AL ,
 
 I'm a bigineer to LuceneI api , I have tried many to implement a simple
 search which specified the data directory and the index director .. then
 specify the qurey to search in the contents
 
 but while I'm displaying the results in the search function I can't
 display
 the name of the file ..
 
 
 please anyone has a simple search Lucene Application which I can
 uderstant
 it and solve my problem
 
 send me @ hamada.zah...@yahoo.com
 
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Lucene-
 Simple-Project-tp3079744p3079744.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Bug fix to contrib/.../IndexSplitter

2011-06-09 Thread Steven A Rowe
Hi Ivan,

You do have rights to submit fixes to Lucene - everyone does!

Here's how: http://wiki.apache.org/lucene-java/HowToContribute

Please create a patch, create an issue in JIRA, and then attach the patch to 
the JIRA issue.  When you do this, you are asked to state that you grant 
license to your work; this is very important for Apache software projects.  All 
JIRA issue creation and modification events are automatically posted to the 
d...@lucene.apache.org mailing list, so all Lucene developers will see your 
work.

Thanks,
Steve

 -Original Message-
 From: Ivan Vasilev [mailto:ivasi...@sirma.bg]
 Sent: Thursday, June 09, 2011 7:24 AM
 To: LUCENE MAIL LIST
 Subject: Bug fix to contrib/.../IndexSplitter
 
 Hi Guys,
 
 I would like to fix a class in
 contrib/misc/src/java/org/apache/lucene/index called IndexSplitter. It
 has a bug - when splits the segments in separate index the segment
 descriptor file contains a wrong data - the number (the name) of next
 segment to generate is 0. Although it can not cause exception in some
 cases (depends on existing segment names and the number of newly
 generated ones) in most of cases it do cases Exception.
 
 I do not know if I would have rights to submit this fix to Lucene
 contrib dir but I am attaching the fix and a test that shows the
 exception when using original class and there is no exception when using
 fixing class.
 
 Cheers,
 Ivan


RE: FastVectorHighlighter StringIndexOutofBounds bug

2011-05-22 Thread Steven A Rowe
Hi WeiWei,

Thanks for the report. 

Can you provide a self-contained unit test that triggers the bug?

Thanks,
Steve

 -Original Message-
 From: Weiwei Wang [mailto:ww.wang...@gmail.com]
 Sent: Monday, May 23, 2011 1:25 AM
 To: java-user@lucene.apache.org
 Subject: FastVectorHighlighter StringIndexOutofBounds bug
 
 the following code has a bug of StringIndexOutofBounds when multiple
 matched
 terms need highlight
 
 private String makeFragment( WeightedFragInfo fragInfo, String src, int
 s,
   String[] preTags, String[] postTags, Encoder encoder ){
 StringBuilder fragment = new StringBuilder();
 int srcIndex = 0;
 for( SubInfo subInfo : fragInfo.subInfos ){
   for( Toffs to : subInfo.termsOffsets ){
 fragment
   .append( encoder.encodeText( src.substring( srcIndex,
 to.startOffset - s ) ) )
   .append( getPreTag( preTags, subInfo.seqnum ) )
   .append( encoder.encodeText( src.substring( to.startOffset - s,
 to.endOffset - s ) ) )
   .append( getPostTag( postTags, subInfo.seqnum ) );
 srcIndex = to.endOffset - s;
   }
 }
 fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );
 return fragment.toString();
   }--
 王巍巍
 Cell: 18911288489
 MSN: ww.wang...@gmail.com
 Blog: http://whisper.eyesay.org
 围脖:http://t.sina.com/lolorosa


RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud,

That's normal behavior, since you have AND as default operator.  This is 
equivalent to placing a + in front of every element of your query.  In fact, 
if you removed the other two +s, you would get the same behavior.  I think 
you'll get what you want by just switching the default operator to OR?

Steve

 -Original Message-
 From: Renaud Delbru [mailto:renaud.del...@deri.org]
 Sent: Friday, May 20, 2011 5:10 AM
 To: java-user@lucene.apache.org
 Subject: Query Parser, Unary Operators and Multi-Field Query
 
 Hi,
 
 The behaviour of the query parser (either the standard lucene query
 parser, or the query parser contrib) is not what I expect when I am using
 - unary operators
 - a multi-field query
 - AND as default operator.
 
 For example, let say I have two field fieldA and fieldB, and the
 following query:
 +termA +termB termC
 
 Lucene query parsers will expand the query as:
 +(fieldA:termA fieldB:termA) +(fieldA:termB fieldB:termB) +(fieldA:termC
 fieldB:termC)
 
 while I would have expected this
 
 +(fieldA:termA fieldB:termA) +(fieldA:termB fieldB:termB) (fieldA:termC
 fieldB:termC)
 
 Is it the normal behaviour ? A Bug ? Am I doing something wrong ?
 
 Thanks in advance for your help,
 --
 Renaud Delbru
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud,

On 5/20/2011 at 1:58 PM, Renaud Delbru wrote:
 As said in 
 http://lucidworks.lucidimagination.com/display/LWEUG/Boolean+Operators, 
 if one or more of the terms in a term list has an explicit term operator
 (+ or - or relational operator) the rest of the terms will be treated as 
 nice to have.

I am not familiar with Lucid's offerings, so I can't comment on the 
documentation you quoted.  Note, however, that LucidWorks and Lucene are 
different products.

 I would have expected that the default AND operator applies whenever no
 other operators are precised in the query.

 For exmaple
 
  cat +dog -fox
 
 Selects documents which must contain dog and must not contain fox.
 Documents will rank higher if cat is present, but it is not required.
 
 I would have expected such behaviour, whatever Default Operator as
 been defined.

 But it seems that I need to use the Default Operator OR to have this
 behaviour, which breaks our current requirement (we want default
 operator AND if no operators are precised in the query).

Restating: you want default AND behavior when the query contains no operators, 
and default OR behavior when the query *does* contain operators.

This is not supported by the Lucene QueryParser.

 IS there anyway to achieve this ? Or do I need to extend myself the
 queryparser contrib ?

A workaround may be to simply look for + and - in the query under one of 
the following conditions: preceded either by ( or  , or at the beginning of 
the string, e.g. using a regex like /(?:^|[\s(])[+-]/, and if you find a match, 
use default OR operator, and if not, use default AND operator?

Steve



RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
Hi Cheng,

Lucene 3.3 does not exist - do you mean branches/branch_3x ?

FYI, as of Lucene 3.1, there is an Ant target you can use to setup an Eclipse 
project for  Lucene/Solr - run this from the top level directory of a full 
source tree (including dev-tools/ directory) checked out from Subversion: 

   ant eclipse

More info here:

   http://wiki.apache.org/solr/HowToContribute#Development_Environment_Tips

Steve

 -Original Message-
 From: cheng [mailto:zhoucheng2...@gmail.com]
 Sent: Sunday, May 15, 2011 4:29 AM
 To: java-user@lucene.apache.org
 Subject: Lucene 3.3 in Eclipse
 
 Hi, I created a java project for Lucene 3.3 in Eclipse, and found that in
 the DbHandleExtractor.java file, the package of
 com.sleepycat.db.internal.Db
 is not resolved. How can I overcome this?
 
 
 
 I have tried to download .jar for this, but don't know which and where to
 download.
 
 
 
 Thanks



RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
(Resending to the list - didn't notice that my reply went to Cheng directly)

There is an Ant target get-db-jar that can do the downloading for you - you 
can see the URL it uses here:

http://svn.apache.org/viewvc/lucene/java/tags/lucene_3_0_3/contrib/db/bdb/build.xml?view=markup#l49

There is another Ant target get-je-jar that does the same thing for the 
contrib/db/bdb-je/ module:

http://svn.apache.org/viewvc/lucene/java/tags/lucene_3_0_3/contrib/db/bdb-je/build.xml?view=markup#l49

Steve

 -Original Message-
 From: cheng [mailto:zhoucheng2...@gmail.com]
 Sent: Sunday, May 15, 2011 10:48 AM
 To: java-user@lucene.apache.org
 Cc: Steven A Rowe
 Subject: RE: Lucene 3.3 in Eclipse
 
 Steve, thanks for correction. You are right. The version is 3.0.3
 released last Oct.
 
 I did place an ant jar in Eclipse, and it does the job to remove some
 compiling errors. However, it seems that I do need some jar file to
 handle the DbHandleExtractor.java and the org.apache.lucene.store.db
 package, which are under contrib/db/bdb/src/java folder.
 
 Do you know when I can find the proper jar file?
 
 Cheng
 
 -Original Message-
 From: Steven A Rowe [mailto:sar...@syr.edu]
 Sent: Sunday, May 15, 2011 10:08 PM
 To: java-user@lucene.apache.org
 Subject: RE: Lucene 3.3 in Eclipse
 
 Hi Cheng,
 
 Lucene 3.3 does not exist - do you mean branches/branch_3x ?
 
 FYI, as of Lucene 3.1, there is an Ant target you can use to setup an
 Eclipse project for  Lucene/Solr - run this from the top level directory
 of a full source tree (including dev-tools/ directory) checked out from
 Subversion:
 
ant eclipse
 
 More info here:
 
 
 http://wiki.apache.org/solr/HowToContribute#Development_Environment_Tips
 
 
 Steve
 
  -Original Message-
  From: cheng [mailto:zhoucheng2...@gmail.com]
  Sent: Sunday, May 15, 2011 4:29 AM
  To: java-user@lucene.apache.org
  Subject: Lucene 3.3 in Eclipse
 
  Hi, I created a java project for Lucene 3.3 in Eclipse, and found that
 in
  the DbHandleExtractor.java file, the package of
  com.sleepycat.db.internal.Db
  is not resolved. How can I overcome this?
 
 
 
  I have tried to download .jar for this, but don't know which and where
 to
  download.
 
 
 
  Thanks
 



RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
A thought: one way to do #1 without modifying ShingleFilter: if there were a 
StopFilter variant that accepted regular expressions instead of a stopword 
list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a full 
match is required, i.e. implicit beginning and end anchors), and place it in 
the analysis pipeline after ShingleFilter to throw out shingles with filler 
tokens in them.

(It think it would be useful to generalize StopFilter to allow for more sources 
of stoppage, rather than just creating a StopRegexFilter with no relation to 
StopFilter.)

Steve

 -Original Message-
 From: Elmo Bleek [mailto:barb...@gmail.com]
 Sent: Thursday, May 12, 2011 12:51 PM
 To: java-user@lucene.apache.org
 Subject: Re: Can I omit ShingleFilter's filler tokens
 
 I have found that simply having StopFilter before ShingleFilter does the
 trick for #2. However, I have also been working on trying to accomplish
 #1,
 don't create shingles across stop words. I am currently under the
 impression
 that this will take modifying ShingleFilter. Does anyone have any
 suggestions?
 
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Can-I-
 omit-ShingleFilter-s-filler-tokens-tp2926009p2932604.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
Cool!  I had forgotten about FilteringTokenFilter.

Elmo, would you care to make a JIRA issue and a patch (based on Robert's code, 
and adding some tests) to create this?  If so, this may be useful:

http://wiki.apache.org/lucene-java/HowToContribute

Steve

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Thursday, May 12, 2011 1:15 PM
 To: java-user@lucene.apache.org
 Subject: Re: Can I omit ShingleFilter's filler tokens
 
 On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe sar...@syr.edu wrote:
  A thought: one way to do #1 without modifying ShingleFilter: if there
 were a StopFilter variant that accepted regular expressions instead of a
 stopword list, you could configure it with a regex like /_ .*|.* _| _ /
 (assuming a full match is required, i.e. implicit beginning and end
 anchors), and place it in the analysis pipeline after ShingleFilter to
 throw out shingles with filler tokens in them.
 
  (It think it would be useful to generalize StopFilter to allow for more
 sources of stoppage, rather than just creating a StopRegexFilter with no
 relation to StopFilter.)
 
 
 we already did this in 3.1 by making a base FilteringTokenFilter class?
 a regex filter is trivial if you subclass this (we could add something
 like this untested code to the .pattern package or whatever)
 
 public class PatternRemoveFilter extends FilteringTokenFilter {
   private final Matcher matcher;
   private final CharTermAttribute termAtt =
 addAttribute(CharTermAttribute.class);
 
   public PatternRemoveFilter(boolean enablePositionIncrements,
 TokenStream input, Pattern pattern) {
 super(enablePositionIncrements, input);
 matcher = pattern.matcher(termAtt);
   }
 
   @Override
   protected boolean accept() throws IOException {
 matcher.reset();
 return !matcher.matches();
   }
 }
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
Hi Bill,

I can think of two possible interpretations of removing filler tokens:

1. Don't create shingles across stopwords, e.g. for text one two three four 
five and stopword three, bigrams only, you'd get (one two, four five), 
instead of the current (one two, two _, _ four, four five).

2. Create shingles as if the stopwords were never there, e.g. for the same text 
and stopword, bigrams only, you'd get (one two, two four, four five).

Which one did you have in mind?  #2 can be achieved by adding PositionFilter 
after StopFilter and before ShingleFilter.  I think #1 requires ShingleFilter 
modifications.

Steve

 -Original Message-
 From: William Koscho [mailto:wkos...@gmail.com]
 Sent: Wednesday, May 11, 2011 12:05 AM
 To: java-user@lucene.apache.org
 Subject: Can I omit ShingleFilter's filler tokens
 
 Hi,
 
 Can I remove the filler token _ from the n-gram-tokens that are generated
 by
 a ShingleFilter?
 
 I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter,
 and ShingleFilter to create phrase n-grams.  The ShingleFilter inserts
 FILLER_TOKENs in place of the stopwords, but I don't want them.
 
 How can I omit the filler tokens?
 
 thanks
 Bill


RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly get 
higher throughput than inserting PositionFilter.  Like PositionFilter, this 
will buy you #2 (create shingles as if stopwords were never there), but not #1 
(don't create shingles across stopwords).

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Wednesday, May 11, 2011 9:02 AM
 To: java-user@lucene.apache.org
 Subject: Re: Can I omit ShingleFilter's filler tokens
 
 another idea is to .setEnablePositionIncrements(false) on your
 stopfilter.
 
 On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe sar...@syr.edu wrote:
  Hi Bill,
 
  I can think of two possible interpretations of removing filler
 tokens:
 
  1. Don't create shingles across stopwords, e.g. for text one two three
 four five and stopword three, bigrams only, you'd get (one two,
 four five), instead of the current (one two, two _, _ four, four
 five).
 
  2. Create shingles as if the stopwords were never there, e.g. for the
 same text and stopword, bigrams only, you'd get (one two, two four,
 four five).
 
  Which one did you have in mind?  #2 can be achieved by adding
 PositionFilter after StopFilter and before ShingleFilter.  I think #1
 requires ShingleFilter modifications.
 
  Steve
 
  -Original Message-
  From: William Koscho [mailto:wkos...@gmail.com]
  Sent: Wednesday, May 11, 2011 12:05 AM
  To: java-user@lucene.apache.org
  Subject: Can I omit ShingleFilter's filler tokens
 
  Hi,
 
  Can I remove the filler token _ from the n-gram-tokens that are
 generated
  by
  a ShingleFilter?
 
  I'm using a chain of filters: ClassicFilter, StopFilter,
 LowerCaseFilter,
  and ShingleFilter to create phrase n-grams.  The ShingleFilter inserts
  FILLER_TOKENs in place of the stopwords, but I don't want them.
 
  How can I omit the filler tokens?
 
  thanks
  Bill
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul,

What did you find about Luke that's buggy?  Bug reports are very useful; please 
contribute in this way.

The official Lucene 3.0.3 distribution jars were compiled using the -g cmdline 
argument to javac - by default, though, only line number and source file 
information is generated.  If you want local variable information too, you 
could download the source and make your own debug-enabled jar(s), right?:

0. Install Ant 1.7.1: http://archive.apache.org/dist/ant/binaries/

1. svn checkout http://svn.apache.org/repos/asf/lucene/java/tags/lucene_3_0_3

2. Add 'debuglevel=lines,source,vars' to the compile macrodef in 
common-build.xml 
http://svn.apache.org/viewvc/lucene/java/tags/lucene_3_0_3/common-build.xml?revision=1040994view=markup#l536
 in the javac task invocation, e.g.:

545:javac
546:  encoding=${build.encoding}
547:  srcdir=@{srcdir}
548:  destdir=@{destdir}
549:  deprecation=${javac.deprecation}
550:  debug=${javac.debug}
Add -- debuglevel=lines,source,vars
...

3. run ant clean jar from the command line.  The Lucene core jar will be in 
the build/ directory.  (If you need one of the contrib jars, run ant package 
instead.)

Steve

 -Original Message-
 From: Paul Taylor [mailto:paul_t...@fastmail.fm]
 Sent: Friday, April 29, 2011 7:09 AM
 To: java-user@lucene.apache.org
 Subject: Lucene 3.0.3 with debug information
 
 Is there a built debug version of lucene 3.0.3 so I can profile it
 properly to find what part of the search is taking the time.
 
 Note:Ive already profiled by application and determined that it is the
 lucene/Search that is taking the time, I also had another attempt using
 luke but find it incredibly buggy and of little use.
 
 thanks Paul
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul,

On 4/29/2011 at 4:14 PM, Paul Taylor wrote:
 On 29/04/2011 16:03, Steven A Rowe wrote:
  What did you find about Luke that's buggy?  Bug reports are very
  useful; please contribute in this way.

 Please see previous post, in summary mistake on my part.

Okay... Which previous post?  I searched for posts by you to Lucene mailing 
lists, and found no mention of Luke other than the one complaining about bugs?

Steve



RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Thanks Dawid. – Steve

From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf Of Dawid 
Weiss
Sent: Friday, April 29, 2011 4:45 PM
To: java-user@lucene.apache.org
Cc: Steven A Rowe
Subject: Lucene 3.0.3 with debug information


This is the e-mail you're looking for, Steven (it wasn't forwarded to the list, 
apparently).

Dawid
-- Forwarded message --
From: Paul Taylor paul_t...@fastmail.fmmailto:paul_t...@fastmail.fm
Date: Fri, Apr 29, 2011 at 10:11 PM
Subject: Re: Lucene 3.0.3 with debug information
To: Dawid Weiss dawid.we...@gmail.commailto:dawid.we...@gmail.com

On 29/04/2011 15:17, Dawid Weiss wrote:

 lucene/Search that is taking the time, I also had another attempt using luke
 but find it incredibly buggy and of little use

Can you expand on this too? What kind of incredible bugs did you see? Without 
feedback there is little progress, so bug reports count.

Dawid
Sorry, I'll withdraw that. I was getting all kinds of stacktraces and 
exceptions when I tried to do searches but the problem was my fault. Because I 
wanted to use my own analyzer  I had a shells script that added it to the 
classpath when I ran luke, however I had put it before the ant jar and my jar 
built with maven also included lucene 3.0.3 and because luke 1.0.1 is packaged 
with 3.0.0 it was confusing it, but I didnt realize this until I notice done 
exception complained a lucene method was missing.

But having got it working I cannot see anything to help me work out why the 
queries are taking too long, is it useful for this or just for refining your 
queries ?

Paul



RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-27 Thread Steven A Rowe
Ranjit,

The problem is definitely the analyzer you are passing to QueryParser or 
MultiFieldQueryParser, and not the parser itself.

The following tests succeed using KeywordAnalyzer, which is a pass-through 
analyzer (the output is the same as the input):

  public void testSharpQP() throws Exception {
Analyzer analyzer = new KeywordAnalyzer();
QueryParser qp = new QueryParser
(Version.LUCENE_30, default_field, analyzer);
assertEquals(+c# +.net, qp.parse
(c# AND .net).toString(default_field));
  } 

  public void testSharpMFQP() throws Exception {
Analyzer analyzer = new KeywordAnalyzer();
String[] fields = { one, two};
MultiFieldQueryParser mfqp = new MultiFieldQueryParser
(Version.LUCENE_30, fields, analyzer);
assertEquals(+(one:c# two:c#) +(one:.net two:.net), 
 mfqp.parse(c# AND .net).toString());
  }

Steve

 -Original Message-
 From: Ranjit Kumar [mailto:ranjit.ku...@otssolutions.com]
 Sent: Wednesday, April 27, 2011 3:24 AM
 To: java-user-h...@lucene.apache.org; java-user@lucene.apache.org
 Subject: Re: lucene 3.0.3 | QueryParser | MultiFieldQueryParser
 
 Hi,
 while creating index with the help of  lucene standardAnalyzer, we cannot
 make difference between c, c++ and c# as lucene do not create index for
 c++ and c#. To make the difference between these term I need to change
 the grammar of lucene with the help of jFlex, it force me to create my
 own custom analyzer.
 
 While I am searching for single term like c# I get correct result (also
 in the case of c++) So, lucene make index for C++ and c# . Also do not
 need to use any Parser. Hence,  jFlex doing it work properly.
 But, when I am trying to search for multiple Boolean query like c# AND
 .net  I need to use  MultiFieldQueryParser to get correct
 result(document). Then Parser stripping off # but do not dot(.) so query
 became c AND .net
 Also, I have made changes for c#.net, vb.net, .net all these work
 properly with MultiFieldQueryParser except c#
 
 Thanks  Regards,
 Ranjit Kumar
 =
 == Private, Confidential and Privileged. This e-
 mail and any files and attachments transmitted with it are confidential
 and/or privileged. They are intended solely for the use of the intended
 recipient. The content of this e-mail and any file or attachment
 transmitted with it may have been changed or altered without the consent
 of the author. If you are not the intended recipient, please note that
 any review, dissemination, disclosure, alteration, printing, circulation
 or Transmission of this e-mail and/or any file or attachment transmitted
 with it, is prohibited and may be unlawful. If you have received this e-
 mail or any file or attachment transmitted with it in error please notify
 OTS Solutions at i...@otssolutions.com
 =
 ==


RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-26 Thread Steven A Rowe
Hi Ranjit,

I suspect the problem is not QueryParser, since the TERM definition includes 
the '#' character (from 
http://svn.apache.org/viewvc/lucene/java/tags/lucene_3_0_3/src/java/org/apache/lucene/queryParser/QueryParser.jj?view=markup#l1136):

| #_TERM_START_CHAR: ( ~[  , \t, \n, \r, \u3000, +, -,
 !, (, ), :, ^, [, ], \,
 {, }, ~, *, ?, \\ ]
 | _ESCAPED_CHAR ) 
| #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - | + ) 
...
| TERM: _TERM_START_CHAR (_TERM_CHAR)* 

Are you sure that your custom JFlex Analyzer is not being given 'C#' and then 
stripping off the '#'?

You could work around this issue by pre-processing your query (and your 
documents) to replace C# with csharp or something like it that would not be 
broken up.

Steve

 -Original Message-
 From: Ranjit Kumar [mailto:ranjit.ku...@otssolutions.com]
 Sent: Tuesday, April 26, 2011 9:55 AM
 To: java-user-h...@lucene.apache.org; java-user@lucene.apache.org
 Subject: lucene 3.0.3 | QueryParser | MultiFieldQueryParser
 
 Hi,
 
 I have created my own custom analyzer and uses jFlex to made search for
 c#, .net, c++ etc.
 
 While I am trying to search c#, .net, c++ QueryParser parse .net to .net
 and C++ to C++. So it works fine. But in case of C# QueryParser parse it
 to C which makes trouble for me.
 
 Also tried to use MultiFieldQueryParser but it also do the same.
 
 Any help or suggestion will be appreciated!!!
 
 
 Thanks  Regards,
 Ranjit Kumar
 
 =
 == Private, Confidential and Privileged. This e-
 mail and any files and attachments transmitted with it are confidential
 and/or privileged. They are intended solely for the use of the intended
 recipient. The content of this e-mail and any file or attachment
 transmitted with it may have been changed or altered without the consent
 of the author. If you are not the intended recipient, please note that
 any review, dissemination, disclosure, alteration, printing, circulation
 or Transmission of this e-mail and/or any file or attachment transmitted
 with it, is prohibited and may be unlawful. If you have received this e-
 mail or any file or attachment transmitted with it in error please notify
 OTS Solutions at i...@otssolutions.com
 =
 ==


RE: lucene 3.0.3 | searching problem with *.docx file

2011-04-12 Thread Steven A Rowe
Hi Ranjit,

Do you know about Luke?  It will let you see what's in your index, and much 
more:

http://code.google.com/p/luke/

Steve

 -Original Message-
 From: Ranjit Kumar [mailto:ranjit.ku...@otssolutions.com]
 Sent: Tuesday, April 12, 2011 9:05 AM
 To: java-user-h...@lucene.apache.org; java-user@lucene.apache.org
 Subject: lucene 3.0.3 | searching problem with *.docx file
 
 Hi,
 
 I am creating index with help of StandardAnalyzer for *.docx file it's
 fine. But at the time of searching it do not gives result for these *.docx
 file.
 
 any help or suggestion will be appreciated!!!
 
 
 Thanks  Regards,
 Ranjit Kumar
 ==
 = Private, Confidential and Privileged. This e-
 mail and any files and attachments transmitted with it are confidential
 and/or privileged. They are intended solely for the use of the intended
 recipient. The content of this e-mail and any file or attachment
 transmitted with it may have been changed or altered without the consent
 of the author. If you are not the intended recipient, please note that any
 review, dissemination, disclosure, alteration, printing, circulation or
 Transmission of this e-mail and/or any file or attachment transmitted with
 it, is prohibited and may be unlawful. If you have received this e-mail or
 any file or attachment transmitted with it in error please notify OTS
 Solutions at i...@otssolutions.com
 ==
 =


RE: Lucene 3.1

2011-04-05 Thread Steven A Rowe
Hi Tanuj,

Can you be more specific?  

What file did you download? (Lucene 3.1 has three downloadable packages: 
-src.tar.gz, .tar.gz, and .zip.) 

What did you expect to find that is not there?  (Some examples would help.)

Steve

 -Original Message-
 From: Tanuj Jain [mailto:tanujjain.m...@gmail.com]
 Sent: Tuesday, April 05, 2011 7:27 AM
 To: java-user@lucene.apache.org
 Subject: Lucene 3.1
 
 Hi,
 I have downloaded lucene 3.1 and want to use in my program.
 I found lot of files that differ/missing from lucene 3.0. Is there any way
 I
 could get those files as a whole rather than searching for each file and
 downloading it.


RE: word + ngram tokenization

2011-04-05 Thread Steven A Rowe
Hi Shambhu,

ShingleFilter will construct word n-grams:

http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html

Steve

 -Original Message-
 From: sham singh [mailto:shamsing...@gmail.com]
 Sent: Tuesday, April 05, 2011 5:53 PM
 To: java-user@lucene.apache.org
 Subject: word + ngram tokenization
 
 Hi All,
 
 I have to do tokenization which is combination of NGram and Standard
 tokenization
 for ex if the content is  :the quick brown fox jumped over the lazy dog
 requirement is to tokenize into:
 quick brown fox
 brown fox jumped
 fox jumped over etc
 ..
 ..
 
 Please help me to find out best analyzer for my requirement
 
 Thanks in Advance
 
 --
 Many Thanks,
 Shambhu
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: lucene-snowball 3.1.0 packages are missing?

2011-04-03 Thread Steven A Rowe
Hi Alex,

From Lucene contrib CHANGES.html
http://lucene.apache.org/java/3_1_0/changes/Contrib-Changes.html#3.1.0.changes_in_backwards_compatibility_policy:

3. LUCENE-2226: Moved contrib/snowball functionality into 
contrib/analyzers. Be sure to remove any old obselete 
lucene-snowball jar files from your classpath!
(Robert Muir)

You want the lucene-analyzers jar:

http://repo2.maven.org/maven2/org/apache/lucene/lucene-analyzers/3.1.0/

Steve

 -Original Message-
 From: Alex Ott [mailto:alex...@gmail.com]
 Sent: Sunday, April 03, 2011 7:54 AM
 To: java-user@lucene.apache.org
 Subject: lucene-snowball 3.1.0 packages are missing?
 
 Hello
 
 I'm trying to upgrade Lucene in my project to 3.1.0 release, but there is
 no lucene-snowball 3.1.0 package on maven central. Is it intended
 behaviour? Should I continue to use 3.0.3 for snowball package?
 
 
 --
 With best wishes, Alex Ott
 http://alexott.blogspot.com/http://alexott.net/
 http://alexott-ru.blogspot.com/
 Skype: alex.ott
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Steven A Rowe
 [x] ASF Mirrors (linked in our release announcements or via the Lucene
 website)
 
 [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
 
 [x] I/we build them from source via an SVN/Git checkout.


RE: lucene-based log searcher?

2011-01-13 Thread Steven A Rowe
Hi Paul,

I saw this yesterday, but haven't tried it myself:

http://karussell.wordpress.com/2010/10/27/feeding-solr-with-its-own-logs/

The author has a project called Sogger - Solr + Logger? - that can read 
various forms of logs.

Steve

 -Original Message-
 From: Paul Libbrecht [mailto:p...@hoplahup.net]
 Sent: Thursday, January 13, 2011 7:54 AM
 To: java-user@lucene.apache.org
 Subject: lucene-based log searcher?
 
 
 Hello list,
 
 has anyone built a log-analyzer based on Lucene?
 Our logs are so big that grep takes more hours to do what I want it to do.
 I'm sure Lucene would solve it.
 
 Thanks in advance
 paul
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Re: Scale up design

2010-12-22 Thread Steven A Rowe
On 12/22/2010 at 2:38 AM, Ganesh wrote:
 Any other tips targeting 64 bit?

If memory usage is an issue, you might consider using HotSpot's compressed 
oops option:

http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

Benson Margulies has written that the memory savings from using compressed 
oops isn't necessarily free - it can impact performance:

http://lists.apple.com/archives/java-dev/2010/Apr/msg00157.html

Steve



RE: Analyzer

2010-11-29 Thread Steven A Rowe
Hi Manjula,

It's not terribly clear what you're doing here - I got lost in your description 
of your (two? or maybe four?) classes.  Sometimes things are easier to 
understand if you provide more concrete detail.

I suspect that you could benefit from reading the book Lucene in Action, 2nd 
edition: 

   http://www.manning.com/hatcher3/

You would also likely benefit from using Luke, the Lucene index browser, to 
better understand your indexes' contents and debug how queries match documents: 

   http://code.google.com/p/luke/

I think your question is whether you're using Analyzers correctly.  It sounds 
like you are creating two separate indexes (one for each of your classes), and 
you're using SnowballAnalyzer on the indexing side for both indexes, and 
StandardAnalyzer on the query side.  

The usual advice is to use the same Analyzer on both the query and the index 
side.  But it appears to be the case that you are taking stemmed index terms 
from your index #1 and then querying index #2 using these stemmed terms.  If 
this is true, then you want the query-time analyzer in your second index not to 
change the query terms.  You'll likely get better results using 
WhitespaceAnalyzer, which tokenizes on whitespace and does no further analysis, 
rather than StandardAnalyzer.

Steve

 -Original Message-
 From: manjula wijewickrema [mailto:manjul...@gmail.com]
 Sent: Monday, November 29, 2010 4:32 AM
 To: java-user@lucene.apache.org
 Subject: Analyzer
 
 Hi,
 
 In my work, I am using Lucene and two java classes. In the first one, I
 index a document and in the second one, I try to search the most relevant
 document for the indexed document in the first one. In the first java
 class,
 I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer
 in
 the searchIndex method and pass the highest frequency terms into the
 second
 Java class. In the second class, I use SnowballAnalyzer in the createIndex
 method (this index is for the collection of documents to be searched, or
 it
 is my database) and StandardAnalyser in the searchIndex method (I pass the
 highest frequently occuring term of the first class as the search term
 parameter to the searchIndex method of the second class). Using Analyzers
 in
 this manner, what I am willing is to do the stemming, stop-words in both
 indexes (in both classes) and to search those a few high frequency words
 (of
 the first index) in the second index. So, if my intention is clear to you,
 could you please let me know whether it is correct or not the way I have
 used Analyzers? I highly appreciate any comment.
 
 Thanx.
 Manjula.


RE: IndexWriters and write locks

2010-11-10 Thread Steven A Rowe
NFS[1] != NTFS[2]

[1] NFS: http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29
[2] NTFS: http://en.wikipedia.org/wiki/NTFS

 -Original Message-
 From: Pulkit Singhal [mailto:pulkitsing...@gmail.com]
 Sent: Wednesday, November 10, 2010 2:55 PM
 To: java-user@lucene.apache.org
 Subject: Re: IndexWriters and write locks
 
 You know that really confuses me. I've heard that stated a few times and
 every time I just felt that it couldn't possibly be right. Maybe it was
 meant in some very specific manner because otherwise aren't all Windows
 OSs
 off-limits to Lucene then?
 
 On Wed, Nov 10, 2010 at 2:40 PM, Uwe Schindler u...@thetaphi.de wrote:
 
  Are you using NFS as filesystem? NFS is incompatible to lucene :-)
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: Pulkit Singhal [mailto:pulkitsing...@gmail.com]
   Sent: Wednesday, November 10, 2010 7:57 PM
   To: java-user@lucene.apache.org
   Subject: Re: IndexWriters and write locks
  
   Thanks Uwe, that helps explain why the lock file is still there.
  
   The last piece of the puzzle is why someone may see exceptions such as
  the
   following from time to time:
  
   java.nio.channels.OverlappingFileLockException
   at
  
 
 sun.nio.ch.FileChannelImpl$SharedFileLockTable.checkList(FileChannelImpl.j
 ava
   :1176)
   at
  
 
 sun.nio.ch.FileChannelImpl$SharedFileLockTable.add(FileChannelImpl.java:10
 7
   8)
   at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:878)
   at java.nio.channels.FileChannel.tryLock(FileChannel.java:962)
   at
  
 org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:236)
   at org.apache.lucene.store.Lock.obtain(Lock.java:72)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1041)
   at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:864)
  
   I suppose this means that the OS itself hasn't released the lock even
  after I shut
   down my application server and restarted it.
   Am I right?
  
   Or is there something else that can possibly be the culprit (in
 anyone's
   experience) that I can investigate?
  
   - Pulkit
  
   On Wed, Nov 10, 2010 at 12:57 PM, Uwe Schindler u...@thetaphi.de
 wrote:
  
This is because Lucene uses Native Filesystem Locks. The lock file
itself is just a placeholder which is not cleaned up on Ctrl-C. The
lock is not the file itself, its *on* the file.
   
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
   
 -Original Message-
 From: Pulkit Singhal [mailto:pulkitsing...@gmail.com]
 Sent: Wednesday, November 10, 2010 3:38 PM
 To: java-user@lucene.apache.org
 Subject: IndexWriters and write locks

 Hello,

 1) On Windows, I often shut down my application server (which has
 active IndexWriters open) using the ctrl+c keys.
 2) I inspect my directories on the file system I see that the
 write.lock
file is still
 there.
 3) I start the app server again, and do some operations that would
require
 IndexWriters to write to the same directories again and it works!

 I don't understand why I do not run into any exceptions?
 I mean there is already a lock file present which should prevent
 the
 IndexWriters from getting access to the directories ... no?
 I should be happy but I'm not because other folks are able to get
exceptions
 when they bounce their servers an I'm unable to reproduce the
 problem and
I
 can't help them.

 Any clues? Anyone?

 Thank You,
 - Pulkit
   
   

 -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
   
   
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

 -Original Message-
 From: Martin O'Shea [mailto:app...@dsl.pipex.com]
 Sent: Sunday, October 24, 2010 3:59 PM
 To: java-user@lucene.apache.org
 Subject: Use of hyphens in StandardAnalyzer
 
 Hello
 
 
 
 I have a StandardAnalyzer working which retrieves words and frequencies
 from
 a single document using a TermVectorMapper which is populating a HashMap.
 
 
 
 But if I use the following text as a field in my document, i.e.
 
 
 
 addDoc(w, lucene Lawton-Browne Lucene);
 
 
 
 The word frequencies returned in the HashMap are:
 
 
 
 browne 1
 
 lucene 2
 
 lawton 1
 
 
 
 The problem is the words 'lawton' and 'browne'. If this is an actual
 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
 the
 name is actually a single word?
 
 
 
 I've tried combinations of:
 
 
 
 addDoc(w, lucene \Lawton-Browne\ Lucene);
 
 
 
 And single quotes but without success.
 
 
 
 Thanks
 
 
 
 Martin O'Shea.
 
 
 
 
 
 



RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
Sorry, releases are not scheduled.

There is a general feeling that a 3.1 release could happen fairly soon, though.

Currently, there is a push to improve test coverage and fix bugs that shake out 
as a result.

As another measure of how close the release is, you can check here to see how 
many issues remain targeting the 3.1 release - once these go to zero, a release 
is likely imminent:

Lucene open/reopened fix for 3.1: 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?sorter/field=priorityresolution=-1pid=12310110fixfor=12314822
 
Solr open/reopened fix for 3.1: 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?sorter/field=priorityresolution=-1pid=12310230fixfor=12314371

My estimate of when a release will occur: sometime in the next two or three 
months.

The 3.X branch (where the 3.1 release will be cut from) is quite stable - you 
should consider using it even pre-release.

Steve

 -Original Message-
 From: Martin O'Shea [mailto:app...@dsl.pipex.com]
 Sent: Sunday, October 24, 2010 5:29 PM
 To: java-user@lucene.apache.org
 Subject: FW: Use of hyphens in StandardAnalyzer
 
 A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a
 StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1
 is due?
 
 -Original Message-
 From: Steven A Rowe [mailto:sar...@syr.edu]
 Sent: 24 Oct 2010 21 31
 To: java-user@lucene.apache.org
 Subject: RE: Use of hyphens in StandardAnalyzer
 
 Hi Martin,
 
 StandardTokenizer and -Analyzer have been changed, as of future version
 3.1 (the next release) to support the Unicode segmentation rules in
 UAX#29.  My (untested) guess is that your hyphenated word will be kept as
 a single token if you set the version to 3.1 or higher in the constructor.
 
 Steve
 
  -Original Message-
  From: Martin O'Shea [mailto:app...@dsl.pipex.com]
  Sent: Sunday, October 24, 2010 3:59 PM
  To: java-user@lucene.apache.org
  Subject: Use of hyphens in StandardAnalyzer
 
  Hello
 
  I have a StandardAnalyzer working which retrieves words and frequencies
  from a single document using a TermVectorMapper which is populating a
  HashMap.
 
  But if I use the following text as a field in my document, i.e.
 
  addDoc(w, lucene Lawton-Browne Lucene);
 
  The word frequencies returned in the HashMap are:
 
  browne 1
  lucene 2
  lawton 1
 
  The problem is the words 'lawton' and 'browne'. If this is an actual
  'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
  the name is actually a single word?
 
  I've tried combinations of:
 
  addDoc(w, lucene \Lawton-Browne\ Lucene);
 
  And single quotes but without success.
 
  Thanks
 
  Martin O'Shea.



RE: Issue with sentence specific search

2010-10-07 Thread Steven A Rowe
Hi Sirish,

StandardTokenizer does not produce a token from '#', as you suspected.  
Something that fits the word definition, but which won't ever be encountered 
in your documents, is what you should use for the delimiter - something like 
a1b2c3c2b1a .

Sentence boundary handling is clunky in Lucene right now - there has been some 
discussion of how to directly support this kind of thing, but no code at this 
point.

Steve

 -Original Message-
 From: Sirish Vadala [mailto:sirishre...@gmail.com]
 Sent: Thursday, October 07, 2010 7:13 PM
 To: java-user@lucene.apache.org
 Subject: RE: Issue with sentence specific search
 
 
 Hi Steven,
 
 I have implemented sentence specific proximity search as suggested below.
 However, unfortunately it still doesn't identify the sentence boundaries
 for
 my search.
 
 I am using # as a delimiter between my sentences while indexing the
 content:
 
 
 ArrayListString sentencesList = sentenceScanner.getAllSentences();
 StringBuffer textWithToken = new StringBuffer();
 for (String sentence : sentencesList){
   textWithToken.append(sentence +  # );
 }
 addFieldToDocument(document, IFIELD_TEXT, textWithToken.toString(), true,
 true);
 
 * Used StandardAnalyzer to initialize the indexWriter while adding the
 document
 
 This is how I am performing my search:
 
 
 Query query = null;
 strQuery = strQuery.replaceAll(\\s+,  );
 String[] spanTerms = strQuery.split( );
 SpanQuery[] spanQueries = new SpanQuery[spanTerms.length];
 for (int count = 0; count  spanTerms.length; count++) {
   String spanTerm = spanTerms[count];
   spanQueries[count] = new SpanTermQuery(new Term(field, spanTerm));
 }
 if(!withinSentence){
   SpanQuery spanQuery = new SpanNearQuery(spanQueries, span, true);
   query = spanQuery;
 } else if (withinSentence){
   SpanQuery queryInclude = new SpanNearQuery(spanQueries, span, true);
   SpanQuery queryExclude = new SpanTermQuery(new Term(field, #));
   SpanQuery spanNotQuery = new SpanNotQuery(queryInclude,
 queryExclude);
   query = spanNotQuery;
 }
 bQuery.add(query, BooleanClause.Occur.MUST);
 
 
 
 When I eventually read my query on the console, this is how it looks in
 both
 cases:
 
 With no sentence boundary
 +(author:amanda) +spanNear([text:efficiency, text:delta], 10, true)
 +(year:2009 year:2010)
 
 With sentence boundary
 +(author:amanda) +spanNot(spanNear([text:efficiency, text:delta], 10,
 true),
 text:#) +(year:2009 year:2010)
 
 My guess is that probably, my index isn't saving the sentence boundary
 value
 # as a separate term. Any hints or pointers on where exactly I am
 mis-implementing would be highly appreciated.
 
 Thanks.
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Issue-
 with-sentence-specific-search-tp1644352p1651512.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish,

I think I understand within sentence phrase search - you want the entire 
phrase to be within a single sentence.  But can you give an example of non 
sentence specific phrase search?  It's not clear to me how useful such 
capability would be.

Steve

 -Original Message-
 From: Sirish Vadala [mailto:sirishre...@gmail.com]
 Sent: Wednesday, October 06, 2010 2:33 PM
 To: java-user@lucene.apache.org
 Subject: Issue with sentence specific search
 
 
 Hello All:
 
 Can any one suggest me the best way to implement both sentence specific
 and
 non sentence specific phrase search? The user is going to have a check box
 for phrase search on the screen that says 'within sentence'. If s/he
 selects
 'within sentence', then I should perform sentence specific search, if not
 this should be a regular non sentence specific search.
 
 Right now I am adding each sentence as a separate field(with the same
 field
 name) to the same document. Also I am setting the  position increment gap
 that I did by sub-classing Analyzer and overriding
 Analyzer#getPositionIncrementGap() to return 10.
 
 Right now all I could think about is to maintain two different versions of
 indexes, one for sentence specific and the other for non sentence
 specific.
 But, this sounds crude as it doubles the size of my entire indexes (around
 7
 gigs). I am pretty sure that there should be a better way to achieve this.
 
 Any hint would be highly appreciated.
 Thanks.
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Issue-
 with-sentence-specific-search-tp1644352p1644352.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish,

Have you looked at SpanQuery's yet?:

http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/spans/package-summary.html

See also this Lucid Imagination blog post by Mark Miller:

http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/

One common technique, instead of using a larger-than-normal position increment 
gap between sentences, is using a sentence boundary token like '$' or something 
else that won't ever itself be the target of search.  Quoting from a post Mark 
Miller made to the lucene-user list last year 
http://www.lucidimagination.com/search/document/c9641cbb1a3bf928/multiline_regex_with_lucene):

First you inject special marker tokens as your paragraph/
sentence markers, then you use a SpanNotQuery that looks
for a SpanNearQuery that doesn't intersect with a
SpanTermQuery containing the special marker term.

Mark's suggestion would work for your within-sentence case, and for the case 
where you don't care about sentence boundaries, you can use SpanNearQuery 
without the SpanNotQuery.

Using this technique, a single field should serve all of your needs.

Steve

 -Original Message-
 From: Sirish Vadala [mailto:sirishre...@gmail.com]
 Sent: Wednesday, October 06, 2010 3:19 PM
 To: java-user@lucene.apache.org
 Subject: RE: Issue with sentence specific search
 
 
 Hmmm... My mistake.
 
 In fact it is not a phrase search, but its a proximity search.
 
 My screen gives four options to the user: -All words, -Exact phrase, -At
 least one word, -Within proximity of xx words.
 
 In case of -All words and -At least one word, this is irrelevant an
 everything works fine.
 
 In case of -Exact phrase, I do need to make it sentence specific that
 works
 well with my current implementation.
 
 In case of -Within proximity of xx words, the user wants to have an
 option,
 to either check within xx words in the same sentence or without any
 sentence
 boundaries.
 
 I am using the following code to perform proximity search:
 
 -
 QueryParser qParser = new QueryParser(Version.LUCENE_29, field,
 this.analyzer);
 qParser.setDefaultOperator(QueryParser.OR_OPERATOR);
 query = qParser.parse(strQuery);
 
 //strQuery format -- Search Text~SPAN
 
 -
 bQuery.add(query, BooleanClause.Occur.MUST);
 -
 
 this.analyzer is my custom analyzer. This is to implement, as I already
 said, right now I am adding each sentence as a separate field(with the
 same
 field name) to the same document. Also I am setting the  position
 increment
 gap that I did by sub-classing Analyzer and overriding
 Analyzer#getPositionIncrementGap() to return 10.
 
 Since for each sentence, the position increment gap is modified, I am not
 sure if I can perform a sentence independent proximity search.
 
 Apologize for not putting it well before and appreciate any responses.
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Issue-
 with-sentence-specific-search-tp1644352p1644598.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
This is not a defect: 
http://wiki.apache.org/lucene-java/LuceneFAQ#Does_Lucene_allow_searching_and_indexing_simultaneously.3F.

 -Original Message-
 From: Justin [mailto:cry...@yahoo.com]
 Sent: Monday, October 04, 2010 2:03 PM
 To: java-user@lucene.apache.org
 Subject: Updating documents with fields that aren't stored
 
 Hi all,
 
 The JavaDocs do not appear to mention that only stored fields persist
 IndexWriter.updateDocument. When opening new readers, from either
 IndexWriter.getReader or IndexReader.open, neither TermDocs nor
 IndexSearcher
 will find terms in fields which weren't stored.
 
 Existing readers, however, do continue to find such terms after
 updateDocument
 has been called. At best, this is confusing. Is this a defect?
 
 Thanks,
 Justin
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
Yes, even for IW.getReader() - from 
http://wiki.apache.org/lucene-java/NearRealtimeSearch:

Now Lucene offers a unified API where one calls getReader
and any updates are immediately searchable.

I.e., the reader returned by getReader doesn't track updates; it too represents 
a shapshot in time.  It's just less costly to reopen.


 -Original Message-
 From: Justin [mailto:cry...@yahoo.com]
 Sent: Monday, October 04, 2010 2:15 PM
 To: java-user@lucene.apache.org
 Subject: Re: Updating documents with fields that aren't stored
 
 Even for IndexWriter.getReader (near real-time)? changes made during an
 IndexWriter session can be  quickly made available for searching without
 closing
 the writer nor calling commit(long).
 
 http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexW
 riter.html#getReader()
 
 
 
 
 
 - Original Message 
 From: Steven A Rowe sar...@syr.edu
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Mon, October 4, 2010 1:05:36 PM
 Subject: RE: Updating documents with fields that aren't stored
 
 This is not a defect:
 http://wiki.apache.org/lucene-
 java/LuceneFAQ#Does_Lucene_allow_searching_and_indexing_simultaneously.3F
 .
 
 
  -Original Message-
  From: Justin [mailto:cry...@yahoo.com]
  Sent: Monday, October 04, 2010 2:03 PM
  To: java-user@lucene.apache.org
  Subject: Updating documents with fields that aren't stored
 
  Hi all,
 
  The JavaDocs do not appear to mention that only stored fields persist
  IndexWriter.updateDocument. When opening new readers, from either
  IndexWriter.getReader or IndexReader.open, neither TermDocs nor
  IndexSearcher
  will find terms in fields which weren't stored.
 
  Existing readers, however, do continue to find such terms after
  updateDocument
  has been called. At best, this is confusing. Is this a defect?
 
  Thanks,
  Justin
 
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
Hi Iam,

Can you say why you don't like the proposed solution?

Also, the example of the scoring you're looking for doesn't appear to be 
hierarchical in nature - can you give illustrate the relationship between the 
tokens in [token1, token2, token3]?  Also, why do you want token1 to contribute 
more to the score than token2?

Steve

 -Original Message-
 From: Iam Jabour [mailto:iamjab...@gmail.com]
 Sent: Wednesday, September 15, 2010 9:20 AM
 To: lucene-group
 Subject: Hierarchical Fields
 
  Hello, any one can help me with fields?
 
 I have the same problem posted in
 http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/luc
 ene-java/HierarchicalFields,
 but I don't like the proposed solutions. I need a order field, like [
 token1, token2, token3]
 If a query match with token1 the score is bigger then a match in
 token2, or same thing like that.
 
 __
 Iam Jabour
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
Summarizing a #lucene conversation I had with Iam (aka PackageLost):

-
Steve: How deep is your hierarchy? I ask because you may be able to have one 
field for each level in the hierarchy, and boost the levels higher the closer 
they are to the root

Iam: Hum, now is ... 5-7.  I think 6

Steve: If you have fields level1, level2, level3, etc., and boost level1 
highest, level2 a little lower, etc., then search against all levels

Iam: But can a document have different fields? There are all fields and plus 
those N levels fields but some documents just need 2 levels.

Steve: Lucene does not require every document to have the same set of fields. 
doc1 could have field A, and nothing else, and doc2 could have field B and 
nothing else, both in the same index. No problem.
-

An example: Iam's third query from below (pop) could be expanded to the 
following QueryParser query (assuming just one content field in addition to 
the levelX fields):

content:pop level1:pop^128 level2:pop^64 level3:pop^32 level4:pop^16 
level5:pop^8 level6:pop^4 level7:pop^2

This would result in doc4, doc5, doc2, which is the desired behavior.

Steve


(12:30:14 PM) sarowe: 
 -Original Message-
 From: Iam Jabour [mailto:iamjab...@gmail.com]
 Sent: Wednesday, September 15, 2010 12:22 PM
 To: java-user@lucene.apache.org
 Subject: Re: Hierarchical Fields
 
 Let's go to some example:
 
 1 - Suppose I have some path tree, like:
 - /music/
   | - rock/
 | - doc1 = artist1 music blues ...
 | - doc2 = artist2 music pop ...
   | - blues/
 | - doc3 = artist3  ...
 | - pop/
 - doc5 = artist1 ... 
   | - pop/
 | -  doc4 = artist1 music rock ...
 
 2 - I created lucene documents like this example:
  field1 = (path, doc1fullpath)
  field2 = (value, doc1Value)
 and do the same to all documents.
 
 3 - now I going to do the search:
   $ rock
 I get some sort like: [doc4, doc1, doc2]
 but I want: [doc1 | doc2] and the others [doc3  doc4] like doc1, doc2,
 doc4
 
   $ music AND blues
 I get: doc1, doc3
 but I want: doc3, doc1
 
   $ pop
 I want: doc4 then doc5 (because the path to doc4 is smaller then doc5)
 
 So to do this I need:
 1 - change field boost
 2 - set priority of path, and to do that: I create N field (one field
 to node in the path) or have some Lucene feature (but I don't know
 how)
 
 Thanks.
 __
 Iam Jabour
 
 
 
 
 On Wed, Sep 15, 2010 at 12:52 PM, Steven A Rowe sar...@syr.edu wrote:
  Hi Iam,
 
  Can you say why you don't like the proposed solution?
 
  Also, the example of the scoring you're looking for doesn't appear to be
 hierarchical in nature - can you give illustrate the relationship between
 the tokens in [token1, token2, token3]?  Also, why do you want token1 to
 contribute more to the score than token2?
 
  Steve
 
  -Original Message-
  From: Iam Jabour [mailto:iamjab...@gmail.com]
  Sent: Wednesday, September 15, 2010 9:20 AM
  To: lucene-group
  Subject: Hierarchical Fields
 
   Hello, any one can help me with fields?
 
  I have the same problem posted in
 
 http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/luc
  ene-java/HierarchicalFields,
  but I don't like the proposed solutions. I need a order field, like [
  token1, token2, token3]
  If a query match with token1 the score is bigger then a match in
  token2, or same thing like that.
 
  __
  Iam Jabour
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Hi Christoph,

There could be several things going on, but it's difficult to tell without more 
information.  

Since excluded terms require a non-empty set from which to remove documents at 
the same boolean clause level, you could try something like title:(*:* 
-Datei*) avl, or -title:Datei* avl.

Another possible problem is case.  If you downcase indexed terms, Datei* will 
not match any of them by default, since no analysis is carried out on wildcard 
terms.  QueryParser has a static method setLowercaseExpandedTerms() that you 
can call to turn on automatic pre-expansion query term downcasing:

http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/queryParser/QueryParser.html#setLowercaseExpandedTerms%28boolean%29

Steve

 -Original Message-
 From: Christoph Hermann [mailto:herm...@informatik.uni-freiburg.de]
 Sent: Monday, August 16, 2010 9:32 AM
 To: java-user@lucene.apache.org
 Subject: Search results include results with excluded terms
 
 Hi,
 
 i've built a local index of the german wikipedia (works fine so far).
 
 Now when i'm searching this index with luke (or my own code) using a query
 like title:(-Datei*) avl i still get results with Documents where the
 title contains: Datei:foo.
 
 The title field is created like this:
 Field fieldTitle = new Field(Metadata.TITLE, title, Field.Store.YES, 
 Field.Index.ANALYZED);
 
 Can someone explain to me why i still get these results?
 
 If i click on explain in luke, it tells me that the score basically came
 from the contents field where avl is included.
 
 So the question is, how do i *exclude* documents? I.e. score the exclusion
 very low, so that these results won't appear at all?
 
 regards
 Christoph
 
 --
 Christoph Hermann
 Institut für Informatik
 Tel: +49 761-203-8171 Fax: +49 761-203-8162
 e-mail: herm...@informatik.uni-freiburg.de
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Oops, setLowercaseExpandedTerms() is an instance method, not static.

I wrote:
 QueryParser has a static method setLowercaseExpandedTerms() that you can call
 to turn on automatic pre-expansion query term downcasing:
 
 http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/queryParser/QueryParser.html#setLowercaseExpandedTerms%28boolean%29


RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin,

 [...] *:* AND -myfield:foo*.
 
 If my document contains myfield:foobar and myfield:dog, the document
 would be thrown out because of the first field. I want to keep the
 document because the second field does not match.

I'm assuming that you mistakenly used the same field name above in 
(myfield:foobar and myfield:dog), and that you instead meant:

myfield1:foobar and myfield2:dog.

I think you can get what you want by specifying every field in the query - 
e.g., if each document has the same set of two fields F1 and F2:

(*:* AND -F1:foo*) OR (*:* AND -F2:foo*)

Truth table for four documents:

Doc1: F1:foobar (no-match), F2:dog  (match)= match
Doc2: F1:cat(match),F2:dog  (match)= match
Doc3: F1:cat(match),F2:foosball (no-match) = match
Doc4: F1:foobar (no-match), F2:foosball (no-match) = no-match

Good luck,
Steve



RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin,

 Unfortunately the suffix requires a wildcard as well in our case. There
 are a limited number of prefixes though (10ish), so perhaps we could
 combine them all into one query. We'd still need some sort of
 InverseWildcardQuery implementation.
 
  use another analyzer so you don't need wildcards
 
 I know analyzers can be used with IndexWriter and with QueryParser. Is
 there somewhere an analyzer could be used to alter the field to match the
 query at search time instead of altering the query to match the field?

Can you give an example of what you mean?

 Our current path to solving our problem requires additional fields which
 need rewritten causing a much larger performance degredation. One of the
 two paths above would be much more desirable.

An inverse query would require rewriting, too, I think.

You say you have 10-ish prefixes.  Can you turn those prefixes into field 
names, and index a token like EMPTY when there are no values for a particular 
prefix?  Then your query would be (F1:EMPTY OR F2:EMPTY ... OR F10:EMPTY).

Steve


RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin,

  an example
 
 PerFieldAnalyzerWrapper analyzers =
 new PerFieldAnalyzerWrapper(new KeywordAnalyzer());
 // myfield defaults to KeywordAnalyzer
 analyzers.addAnalyzer(content, new SnowballAnalyzer(luceneVersion, 
 English));
 // analyzers affects the indexed field value
 IndexWriter writer = new IndexWriter(dir, analyzers, true, mfl);
 // analyzers affects the parsed query string
 QueryParser parser = new QueryParser(luceneVersion, myfield, analyzers);
 parser.setAllowLeadingWildcard(true);
 Query query = parser.parse(*:* AND -myfield:\*foo*\);
 // What about an Analyzer to match field value to the query at search time?
 ScoreDoc[] docs = searcher.search(query, null, 1000).scoreDocs;

I'm afraid that this example doesn't help me - my reading of What about an 
Analyzer to match field value to the query at search time? is that you want 
what Lucene already does, but that's clearly not true. 

  An inverse query would require rewriting, too, I think.
 
 Why would implementing a new Query class requires document changes in the
 index.

Query rewriting is what I meant (and what I thought you meant).  What do you 
mean when you say rewriting - how would it affect indexed documents?

  Can you turn those prefixes into field names
 
 No, the prefixes are not discrete. Multiple field values could start with
 the same prefix.

Hmm, again with the misunderstanding on my part - how is it that prefixes are 
not discrete and also there are a limited number of prefixes ... (10ish)?  
And why is it important that multiple field values could start with the same 
prefix?  Why couldn't you just store all of those that share the same prefix in 
the field corresponding to the prefix?

Steve



RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
  you want what Lucene already does, but that's clearly not true
 
 Hmmm, let's pretend that contents field in my example wasn't analyzed at 
 index
 time. The unstemmed form of terms will be indexed. But if I query with a 
 stemmed
 form or use QueryParser with the SnowballAnalyzer, I'm not going to get a 
 match.
 I could fix this situation by analyzing the indexed field at search time to
 match the query. I don't know that Lucene provides this opportunity and, as I
 said, maybe that's crazy.

I don't know of any facility in Lucene to re-analyze indexed content at search 
time.  This is an IR anti-pattern - it defeats the purpose of constructing the 
inverted index.

People generally decide prior to indexing what kind of queries they need to 
support, and then perform all required document analysis at index time.  To use 
your stemming example, if you need to be able to match against both stemmed and 
unstemmed forms, you would make both a stemmed field and an unstemmed field, 
and then construct corresponding sub-queries against each as required.

  What do you mean when you say rewriting
 
 Our current path to solving our problem requires additional fields which need
 rewritten. I meant actually altering the document in the index. My desire has
 been to write a new Query class implementation whereas you mentioned query
 rewriting (isn't that accomplished as in my example by passing an Analyzer to
 QueryParser?)

Changing the index at query time, unless you have a tiny set of documents, 
sounds like a big mistake to me.  Again, extremely expensive.

Query rewriting: 
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Query.html#rewrite%28org.apache.lucene.index.IndexReader%29

  not discrete...  limited number of prefixes
 
 So my document may have myfield:A*foo*, myfield:B*foo*,
 myfield:A*dog*, and myfield:D*cat*.
 
 Or, to phrase differently, myfield:[PREFIX][PATTERN] may appear any
 number of times where PREFIX comes from the set { A, B, C, D, E, ... }.
 
 This complexity is really a tangent of my question in order to avoid poor
 performance from WildcardQuery.

I still think you could make one field for each PREFIX, and then do whatever 
query you want on each of those fields.

If you need to support contains functionality (e.g. A:*foo*), you might 
want to look into ngram analysis at indexing time (and maybe also at query 
time, depending on the source of your queries):

http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/ngram/NGramTokenFilter.html

You would have to know the minimum and maximum length of the contained string 
you would want to query for.

Steve



RE: ShingleFilter failing with more terms than index phrase

2010-07-13 Thread Steven A Rowe
Hi Ethan,

You'll probably get better answers about Solr specific stuff on the 
solr-u...@a.l.o list.

Check out PositionFilterFactory - it may address your issue:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory

Steve

 -Original Message-
 From: Ethan Collins [mailto:collins.eth...@gmail.com]
 Sent: Tuesday, July 13, 2010 3:42 AM
 To: java-user@lucene.apache.org
 Subject: ShingleFilter failing with more terms than index phrase
 
 I am using lucene 2.9.3 (via Solr 1.4.1) on windows and am trying to
 understand ShingleFilter. I wrote the following code and find that if I
 provide more words than the actual phrase indexed in the field, then the
 search on that field fails (no score found with debugQuery=true).
 
 Here is an example to reproduce, with field names:
 Id: 1
 title_1: Nina Simone
 title_2: I put a spell on you
 
 Query (dismax) with:
 - “Nina Simone I put”  - Fails i.e. no score shown from title_1 search
 (using debugQuery)
 - “Nina Simone” - SUCCESS
 
 But, when I used Solr’s Field Analysis with the ‘shingle’ field (given
 below) and tried “Nina Simone I put”, it succeeds. It’s only during the
 query that no score is provided. I also checked ‘parsedquery’ and it shows
 disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the
 title_1 field.
 
 title_1 and title_2 fields are of type ‘shingle’, defined as:
 
fieldType name=shingle class=solr.TextField
 positionIncrementGap=100 indexed=true stored=true
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory
 maxShingleSize=2 outputUnigrams=false/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory
 maxShingleSize=2 outputUnigrams=false/
/analyzer
/fieldType
 
 Note that I also have a catchall field which is text. I have qf set
 to: 'id^2 catchall' and pf set to: 'title_1^1.5 title_2^1.2'
 
 If I am missing something or doing something wrong please let me know.
 
 -Ethan
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: URL Tokenization

2010-06-24 Thread Steven A Rowe
Hi Sudha,

Sorry, I should have mentioned that the existing patch is intended for use only 
against the trunk version (i.e., version 4.0-dev).

Instructions for checking out a working copy from Subversion are here:

   http://wiki.apache.org/lucene-java/HowToContribute

Once you've done that, change directory to the root directory of the checked 
out working copy and apply the patch, like you did previously.

Steve

 -Original Message-
 From: Sudha Verma [mailto:verma.su...@gmail.com]
 Sent: Thursday, June 24, 2010 12:57 PM
 To: java-user@lucene.apache.org
 Subject: Re: URL Tokenization
 
 Hi Steve,
 
 Thanks for the quick reply and implementing support for URL tokenization.
 Another newbie question about applying this patch.
 
 I have the Lucene 3.0.2 source and I downloaded the patch and tried to
 apply
 it:
 
 lucene-3.0.2 patch -p0  LUCENE-2167.patch
 
 Comes back with the error message:
 
 (output truncated)
 can't find file to patch at input line 13106 Perhaps you used the wrong -p
 or --strip option?
 The text leading up to this was:
 
 
 After looking at the line, it looks like it's trying to find
 modules/analysis/common/build.xml -- which is not part of the official
 3.0.2 src release. And thinking about it, may be I need to use the latest
 source (or a nightly build). But, I couldn't figure how to get that. The
 hudson link for nightly builds on the apache-lucene site seems to be
 broke. Or may be I have a different problem.
 
 I'd appreciate any help.
 
 Thanks,
 Sudha
 
 
 
 On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe sar...@syr.edu wrote:
 
  Hi Sudha,
 
  There is such a tokenizer, named NewStandardTokenizer, in the most
  recent patch on the following JIRA issue:
 
https://issues.apache.org/jira/browse/LUCENE-2167
 
  It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and
  e-mails too, in accordance with the relevant IETF RFCs.
 
  Steve
 
   -Original Message-
   From: Sudha Verma [mailto:verma.su...@gmail.com]
   Sent: Wednesday, June 23, 2010 2:07 PM
   To: java-user@lucene.apache.org
   Subject: URL Tokenization
  
   Hi,
  
   I am new to lucene and I am using Lucene 3.0.2.
  
   I am using Lucene to parse text which may contain URLs. I noticed
   the StandardTokenizer keeps the email addresses in one token, but
   not the URLs.
   I also looked at Solr wiki pages, and even though the wiki page for
   solr.StandardTokenizerFactory says it keeps track of the URL token
   type - it does not seem to be the case.
  
   Is there an Analyzer implementation that can keep the URLs intact
   into
  one
   token? or does anyone have an example of that for Solr or Lucene?
  
   Thanks much,
   Sudha
 


RE: URL Tokenization

2010-06-23 Thread Steven A Rowe
Hi Sudha,

There is such a tokenizer, named NewStandardTokenizer, in the most recent patch 
on the following JIRA issue: 

   https://issues.apache.org/jira/browse/LUCENE-2167

It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails 
too, in accordance with the relevant IETF RFCs.

Steve

 -Original Message-
 From: Sudha Verma [mailto:verma.su...@gmail.com]
 Sent: Wednesday, June 23, 2010 2:07 PM
 To: java-user@lucene.apache.org
 Subject: URL Tokenization
 
 Hi,
 
 I am new to lucene and I am using Lucene 3.0.2.
 
 I am using Lucene to parse text which may contain URLs. I noticed the
 StandardTokenizer keeps the email addresses in one token, but not the
 URLs.
 I also looked at Solr wiki pages, and even though the wiki page for
 solr.StandardTokenizerFactory says it keeps track of the URL token type -
 it does not seem to be the case.
 
 Is there an Analyzer implementation that can keep the URLs intact into one
 token? or does anyone have an example of that for Solr or Lucene?
 
 Thanks much,
 Sudha


RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Hi Andy,

From the API docs for IndexWriter 
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html:

[D]ocuments are added with addDocument and removed
with deleteDocuments(Term) or deleteDocuments(Query).
A document can be updated with updateDocument (which
just deletes and then adds the entire document).
When finished adding, deleting and updating documents, 
close should be called.

These changes  are not visible to IndexReader
until either commit() or close() is called.

So you gotta call commit() or close().  Once you've done that, you can reduce 
the (expensive) cost of opening a new IndexReader by calling reopen():

http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexReader.html#reopen%28%29

Steve

 -Original Message-
 From: andynuss [mailto:andrew_n...@yahoo.com]
 Sent: Monday, June 21, 2010 11:02 AM
 To: java-user@lucene.apache.org
 Subject: search hits not returned until I stop and restart application
 
 
 Hi,
 
 I have an IndexWriter singleton in my program, and an IndexSearcher
 singleton based on a readonly IndexReader singleton.  When I use the
 IndexWriter to index a large document to lucene, and then, while the
 program is still running, use my previously created IndexSearcher to find
 hits in that book, they are not found.  But if I stop and restart the
 application, then they are found.
 
 Andy
 --
 View this message in context: http://lucene.472066.n3.nabble.com/search-
 hits-not-returned-until-I-stop-and-restart-application-
 tp911711p911711.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Andy,

I think batching commits either by time or number of documents is common.

Do you know about NRT (Near Realtime Search)?: 
http://wiki.apache.org/lucene-java/NearRealtimeSearch.  Using 
IndexWriter.getReader(), you can avoid commits altogether, as well as reducing 
update-search latency.  See IndexWriter.getReader() javadocs for more details: 
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html#getReader%28%29.

Depending on requirements, these two strategies can be combined.

Steve

 -Original Message-
 From: andynuss [mailto:andrew_n...@yahoo.com]
 Sent: Monday, June 21, 2010 2:44 PM
 To: java-user@lucene.apache.org
 Subject: RE: search hits not returned until I stop and restart application
 
 
 Maybe you aren't using the IndexReader instance returned by reopen(), but
 instead are continuing to use the instance on which you called reopen()?
 It's tough to figure this kind of thing out without looking at the code.
 
 That was it, I was not using the newly (re)opened index.  By the way, one
 last question.  It doesn't matter for this because I'm indexing one huge
 document at a time, and then committing.  But later, I will also be
 indexing very small documents frequently.  In that case, it would seem
 that if I index a very small document, I don't want to be thrashing with a
 commit after each one, and then a reopen of the reader and reconstruction
 of my searcher.  Do others manage this type of thing with a thread that
 fires at intervals to commit if dirty?


RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
Hi Siraj,

Lucene's MemoryIndex can be used to serve this purpose.

From 
http://lucene.apache.org/java/3_0_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html:

[T]his class targets fulltext search of huge numbers
of queries over comparatively small transient realtime
data (prospective search). 

MemoryIndex can only hold one document at a time.

See also Lucene's InstantiatedIndex, which can hold more than one document at a 
time:

http://lucene.apache.org/java/3_0_1/api/contrib-instantiated/org/apache/lucene/store/instantiated/InstantiatedIndex.html

Steve

On 05/17/2010 at 4:38 PM, Siraj Haider wrote:
 Hello there,
 In oracle text search there is a feature to reverse search using
 ctxrule.  What it does is, you create an index (ctxrule) on a column
 having your search criteria(s) and then throw a document on it and it
 tells you which criteria(s) it satisfies.  Is there something in Lucene
 that does that or there are any plans to do that?
 
 thanks
 -siraj

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
Hi Siraj,

The usual answer to questions like yours (Will performance of Lucene component 
X against my N records be good enough?) is It depends: on the nature of the 
queries, the nature of the documents, the hardware you run on, etc.  That said, 
if you construct your query objects once and reuse them, it'll likely be 
extremely fast.

Here are some benchmarks (which I found by Googling Lucene MemoryIndex) using 
PyLucene:

http://www.sajalkayan.com/prospective-search-using-python.html

Give it a try!  Lucene is pretty easy to get started with.  Ask questions if 
you run into trouble.

Good luck,
Steve

On 05/17/2010 at 5:46 PM, Siraj Haider wrote:
 Hi Steven,
 Thanks for the quick reply.  I checked the documentation of MemoryIndex
 and it seems like, you have to create an index in memory with one
 document and will have to run the queries against that single document.
 But my dilemma is, I might have upto 100,000 queries to run against it.
 Do you think this route will give me results in reasonable amount of
 time, i.e. in a few seconds?
 
 thanks
 -siraj
 
 On 5/17/2010 5:21 PM, Steven A Rowe wrote:
  Hi Siraj,
  
  Lucene's MemoryIndex can be used to serve this purpose.
  
  Fromhttp://lucene.apache.org/java/3_0_1/api/contrib-
  memory/org/apache/lucene/index/memory/MemoryIndex.html:
  
  [T]his class targets fulltext search of huge numbers
  of queries over comparatively small transient realtime
  data (prospective search).
  
  MemoryIndex can only hold one document at a time.
  
  See also Lucene's InstantiatedIndex, which can hold more than one
  document at a time:
  
  http://lucene.apache.org/java/3_0_1/api/contrib-
 instantiated/org/apache/lucene/store/instantiated/InstantiatedIndex.htm
 l
  
  Steve
  
  On 05/17/2010 at 4:38 PM, Siraj Haider wrote:
  
   Hello there, In oracle text search there is a feature to reverse
   search using ctxrule.  What it does is, you create an index (ctxrule)
   on a column having your search criteria(s) and then throw a document
   on it and it tells you which criteria(s) it satisfies.  Is there
   something in Lucene that does that or there are any plans to do that?
   
   thanks
   -siraj


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: PrefixQuery and special characters

2010-04-14 Thread Steven A Rowe
Hi Franz,

The likely problem is that you're using an index-time analyzer that strips out 
the parentheses.  StandardAnalyzer, for example, does this; WhitespaceAnalyzer 
does not.

Remember that hits are the result of matches between index-analyzed terms and 
query-analyzed terms.  Except in the case of synonyms, most people will want 
their index and query analyzers to be the same.

Steve


From: Franz Roth [franzr...@gmx.de]
Sent: Wednesday, April 14, 2010 7:42 AM
To: java-user@lucene.apache.org
Subject: PrefixQuery and special characters

Hi all,

say I have an Index with one field named category. There are two documents 
one with value (testvalue) and one with value test value.
Now somone search with test. My Searchenine uses the 
org.apache.lucene.search.PrefixQuery and finds 2 documents. Maybe he  estimated 
only one hit; owever: if he searches for (test and the Searchengine uses the 
QueryParser.escape to clean the request and takes that PrefixQuery to search 
nothing results.

How can I search for the document (testvalue) and only this one?

Thx!





package foo.bar;

import java.io.IOException;

import junit.framework.TestCase;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;



public class TestPrefixQuery extends TestCase {
public void testEscapeAndPrefix() throws CorruptIndexException,
LockObtainFailedException, IOException {

RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(),
true, IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add(new Field(category, (testvalue), Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field(category, test value, Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();

String value= test;
PrefixQuery query = new PrefixQuery(new Term(category, value));
//log.debug(query.toString());
IndexSearcher searcher = new IndexSearcher(directory);
ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
assertEquals(One for  + value , 2, hits.length); //I want one for 
this?!

value= (test;
String escaped = QueryParser.escape(value);
query = new PrefixQuery(new Term(category, escaped));
//log.debug(query.toString());
hits = searcher.search(query, null, 1000).scoreDocs;
assertEquals(One for  + value + / + escaped, 1, hits.length); 
//FAILS!
}
}

--
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene query with long strings

2010-03-23 Thread Steven A Rowe
Hi Aaron,

Your false positives comments point to a mismatch between what you're 
currently asking Lucene for (any document matching any one of the terms in the 
query) and what you want (only fully correct matches).

You need to identify the terms of the query that MUST match and tell Lucene 
about it (+ syntax is understood by QueryParser to mean a required term).

If your queries come from sources that don't reliably match the indexes values, 
you may need to use synonyms to map between e.g. California and CA, and 
then require that at least one of the synonyms matches (e.g. +(California 
CA)).

Steve

On 03/23/2010 at 5:08 PM, Aaron Schon wrote:
 hi all, I have been playing with Lucene for a while now, but stuck on a
 perplexing issue.
 
 I have an index, with a field Affiliation, some example values are:
 
 - Stanford University School of Medicine, Palo Alto, CA USA, -
 Institute of Neurobiology, School of Medicine, Stanford University,
 Palo Alto, CA, - School of Medicine, Harvard University, Boston MA, -
 Brigham  Women's, Harvard University School of Medicine, Boston, MA -
 Harvard University, Cambridge MA
 
 and so on... (the bottom-line being the affiliations are written in
 multiple ways with no apparent consistency)
 
 I query the index on  the affiliation field using say School of
 Medicine, Stanford University, Palo Alto, CA (with QueryParser) to
 find all Stanford related documents, I get a lot of false +ves,
 presumably because of the presence of School of Medicine etc. etc.
 (note: I cannot use Phrase query because of variability in the way
 affiliation is constructed)
 
 I have tried the following:
 
 1. Use a SpanNearQuery by splitting the search phrase with a whitespace
 (here I get no results!)
 2. Tried boosting (using ^) by splitting with the comma and boosting
 the last parts such as Palo Alto CA with a much higher boost than the
 initial phrases. Here I still get lots of false +ves.
 
 Any suggestions on how to approach this? Is SpanNear the way to go? Any
 other ideas on why I get 0 results?
 
 Thanks in advance for helping a newbie.
 
 AS


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Increase number of available positions?

2010-03-17 Thread Steven A Rowe
Hi Rene,

On 03/17/2010 at 11:17 AM, Rene Hackl-Sommer wrote:
 SpanNot fieldName=MyField
 Include
 !-- Gets all the matching spans within L_2 boundaries and includes
 them --
 SpanNot
 Include
 SpanNear slop=2147483647 inOrder=false 
 SpanTermt293/SpanTerm
 SpanTermt4979/SpanTerm
 /SpanNear
 /Include
 Exclude
 SpanTermL_2/SpanTerm
 /Exclude
 /SpanNot
 /Include
 Exclude
 !-- Gets all the matching spans from L_3 boundaries and excludes them
 --
 SpanNot
 Include
 SpanNear slop=2147483647 inOrder=false 
 SpanTermt293/SpanTerm
 SpanTermt4979/SpanTerm
 /SpanNear
 /Include
 Exclude
 SpanTermL_3/SpanTerm
 /Exclude
 /SpanNot
 /Exclude
 /SpanNot

 Shouldn't this query only leave documents, where t293 and t4979 are in
 the same L_2, but not within the same L_3?

I'm not sure what's wrong with the above (have you tried each of the two nested 
SpanNot clauses independently?), but here's another thing to try:

SpanNot
  Include
SpanOr
  SpanNear slop=2147483647 inOrder=true
SpanTermt293/SpanTerm
SpanTermL_3/SpanTerm
SpanTermt4979/SpanTerm
  /SpanNear
  SpanNear slop=2147483647 inOrder=true
SpanTermt4979/SpanTerm
SpanTermL_3/SpanTerm
SpanTermt293/SpanTerm
  /SpanNear
SpanOr
  Exclude
SpanTermL_2/SpanTerm
  /Exclude
/SpanNot

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene,

Why can't you use a different field for each of the Level_X's, i.e. 
MyLevel1Field, MyLevel2Field, MyLevel3Field?

On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote:
   Search in MyField: Terms T1 and T2 on Level_2 and T3,
   T4, and T5 on  Level_3, which should both be in the
   same Level_1.

I don't understand what you mean by which should both be in the same Level_1. 
 Can you give more details?

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene,

Have you seen SpanNotQuery?: 

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html

For a document that looks like:

Level_1 id=1
  Level_2 id=1
Level_3 id=1T1 T2 T3/Level_3
Level_3 id=2T4 T5 T6/Level_3
Level_3 id=3T7 T8 T9/Level_3
  /Level_2
  Level_2 id=2
Level_3 id=1T10 T11 T12/Level_3
Level_3 id=2T13 T14 T15/Level_3
Level_3 id=3T16 T17 T18/Level_3
  /Level_2
  ...
/Level1
...

You could generate the following token stream (L_X being a concrete level 
boundary token):

L_1 L_2 L_3 T1  T2  T3  L_3 T4  T5  T6  L_3 T7  T8  T9
L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
L_2 ...
...

A query to find T2 and T8 on the same Level_2 would require you to find a span 
containing T2 and T8, but not containing L_2.

This scheme will generalize to as many levels as you need, and you can use 
nested span queries to simultaneously provide constraints at multiple levels.  
No position increment gap required.

Caveat: this scheme is not tested - I could be way off base :).

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Searching Subversion comments:

2010-03-08 Thread Steven A Rowe
Hi Erick,

On 03/08/2010 at 3:48 PM, Erick Erickson wrote:
 Is there any convenient way to, say, find all the files associated with
 patch ? I realize one can (hopefully) get this information from
 JIRA, but... This is a subset of the problem of searching Subversion
 comments.

I know of two commercial implementations (probably not what you're after):

1. Atlassian provides SVN access to open source software projects via hosted 
FishEye instances.  Here's their ASF instance:

  http://fisheye6.atlassian.com/

Mahout has apparently set this up already, and I assume Lucene-java could do 
the same:

  http://fisheye6.atlassian.com/browse/mahout

You can query comments and just about anything else -- see the Query tab.

2. IntelliJ IDEA's Repository tab in the Changes pane provides a VC system 
browser (Subversion among I don't know how many others) with a commit message 
search box -- from a revision hit from a commit message search, you can see a 
tree view of modified files, and then double click on each of them to see 
side-by-side colored differences.  Extremely slick.

Also, in the open source realm:

3. ViewVC (formerly ViewCVS) has a facility to query revision history, 
including commit messages.  Apache's instance, which serves Lucene's 
repository, doesn't expose this functionality, though

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Reverse Search

2010-03-01 Thread Steven A Rowe
Hi Mark,

On 03/01/2010 at 3:35 PM, Mark Ferguson wrote:
 I will be processing short bits of text (Tweets for example), and
 need to search them to see if they certain terms.

You might consider, instead of performing reverse search, just querying all of 
your locations against one document at a time using Lucene's MemoryIndex, which 
is very fast: 

http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html

If you decide to go the reverse search route, Lucene's InstantiatedIndex is 
also very fast, and unlike MemoryIndex, can handle more than one document at a 
time:|

http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/store/instantiated/package-summary.html

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Match span of capitalized words

2010-02-05 Thread Steven A Rowe
Hi Max,

On 02/05/2010 at 10:18 AM, Grant Ingersoll wrote:
 On Feb 3, 2010, at 8:57 PM, Max Lynch wrote:
  Hi, I would like to do a search for Microsoft Windows as a span, but
  not match if words before or after Microsoft Windows are upper cased.
  
  For example, I want this to match: another crash for Microsoft Windows
  today But not this: another crash for Microsoft Windows Server today
  
  Is this possible?  My first attempt started with the SpanRegexQuery
  from the regex contrib package, but I can't figure out how to put in a
  term I do want to match but don't want to include in the final
  highlighting match. Does that make sense?
  
  My example (using WhitespaceAnalyzer since I care about case):
  
  SpanRegexQuery srq1 = new SpanRegexQuery(new Term(contents, Chase));
  SpanRegexQuery srq2 = new SpanRegexQuery(new Term(contents, 
  Bank[\\.]*));
  SpanRegexQuery srq3 = new SpanRegexQuery(new Term(contents, [^A-Z]*));
 
 I'm not sure it supports it, but I wonder if you could use a negative
 lookahead assertion?  Most regex languages support it.

I don't think this would work, since the input to a SpanRegexQuery regex is a 
single Term; following Terms are not included in the input.

I *think* you can get what you want using SpanNotQuery - something like the 
following, using your Microsoft Windows example:

SpanNot:
include:
SpanNear(in-order=true, slop=0):
SpanTerm: Microsoft
SpanTerm: Windows
exclude:
SpanNear(in-order=true, slop=0):
SpanTerm: Microsoft
SpanTerm: Windows
SpanRegex: ^\\p{Lu}.*

Steve
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Unexpected Query Results

2010-02-04 Thread Steven A Rowe
Hi Jamie,

Since phrase query terms aren't analyzed, you're getting exact matches for 
terms было and время, but when you search for them individually, they are 
analyzed, and it is the analyzed query terms that fail to match against the 
indexed terms.  Sounds to me like your index-time and query-time analyzers are 
not the same.  Maybe you have a stemming filter for query-time, but not for 
index-time?

Steve

On 02/04/2010 at 2:39 AM, Jamie wrote:
 Hi
 
 I have some unexpected query results.
 
 When attempting two queries:
 
 1) All fields, exact phrase query returns 48 hits
 
 (priority:было время attach:было время score:было время size:было
 время sentdate:было время archivedate:было время receiveddate:было
 время from:было время to:было время subject:было время cc:было
 время bcc:было время deliveredto:было время flag:было время
 sensitivity:было время sender:было время recipient:было время
 body:было время attachments:было время attachname:было время
 memberof:было время)
 
 2) All fields, All words query return 0 hits
 
 ((priority:было attach:было score:было size:было sentdate:было
 archivedate:было receiveddate:было from:было to:было subject:было
 cc:было bcc:было deliveredto:было flag:было sensitivity:было sender:было
 recipient:было body:было attachments:было attachname:было memberof:было
 ) AND (priority:время attach:время score:время size:время sentdate:время
 archivedate:время receiveddate:время from:время to:время subject:время
 cc:время bcc:время deliveredto:время flag:время sensitivity:время
 sender:время recipient:время body:время attachments:время
 attachname:время memberof:время ))
 
 I am not sure why query 2 returns 0 hits. In my mind it should return 48
 hits as in query (1).
 
 I am using Lucene 3.0. Is there a pseudo field for all search terms?
 
 Thanks
 
 Jamie
 
 
 
 
 
 - To
 unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For
 additional commands, e-mail: java-user-h...@lucene.apache.org




RE: Analyzer for stripping non alpha-numeric characters?

2010-02-04 Thread Steven A Rowe
Hi Jason,

Solr's PatternReplaceFilter(ts, \\P{Alnum}+$, , false) should work, chained 
after an appropriate tokenizer.

Steve

On 02/04/2010 at 12:18 PM, Jason Rutherglen wrote:
 Is there an analyzer that easily strips non alpha-numeric from the end
 of a token?
 
 - To
 unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For
 additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Unexpected Query Results

2010-02-04 Thread Steven A Rowe
On 02/04/2010 at 3:24 PM, Chris Hostetter wrote:
 : Since phrase query terms aren't analyzed, you're getting exact
 : matches
 
 quoted phrase passed to the QueryParser are analyzed -- but they are
 analyzed as complete strings, so Analyzers that treat whitespace
 special may produce differnet Terms then if the individual words
 were analyzed individually (which is what happens when QueryParser
 is given multiple words that aren't in a quoted phrase)

Yikes, you're right, of course (I just checked the code) - I was thinking of 
contrib/misc/AnalyzingQueryParser, which adds analysis to fuzzy, prefix, range, 
and wildcard queries, since *those* are not analyzed by QueryParser, and had 
added phrases to that list in my model of reality...



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: combine query score with external score

2010-01-28 Thread Steven A Rowe
Hi Dennis,

You should check out payloads (arbitrary per-index-term byte[] arrays), which 
can be used to encode values which are then incorporated into documents' 
scores, by overriding Similarity.scorePayload():

http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/Similarity.html#scorePayload%28int,%20java.lang.String,%20int,%20int,%20byte[],%20int,%20int%29

The Lucene in Action 2 MEAP has a nice introduction to using payloads to 
influence scoring, in section 6.5.

See also this (slightly out-of-date*) blog post Getting Started with Payloads 
by Grant Ingersoll at Lucid Imagination:

http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

*Note that since this blog post was written, BoostingTermQuery was renamed to 
PayloadTermQuery (in Lucene 2.9.0+ ; see 
http://issues.apache.org/jira/browse/LUCENE-1827 ; wow - this issue isn't 
mentioned in CHANGES.txt???):

http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/payloads/PayloadTermQuery.html

Steve

On 01/28/2010 at 6:01 AM, Dennis Hendriksen wrote:
 I'm struggling to create a performant query in Lucene 3.0.0 in which I
 want to combine 'regular' scoring with scores derived from external
 sources.
 
 For each document a fixed set of scores is calculated in the range [0.0,
 1.0. These scores represent the confidences that a document falls into
 categories. So for example document #1 has a score of 0.3 for cat=boys,
 0.2 for cat=girls, 0.1 for cat=toys, 0.05 for cat=animals.
 
 The 'regular' scoring is calculated using a BooleanQuery with TermQuerys
 similar to: -type:H +(title:dna body:dna^1.5)
 
 In the current naive approach I'm combining the scores as following: -
 for each document store the three best categories in the following
 fields:
 name=cat1st value=boys fieldboost=0.3
 name=cat2nd value=girls fieldboost=0.2
 name=cat3rd value=toys fieldboost=0.1
 Search-time use the following query if you're interested in 'girls':
 -type:H +(title:dna body:dna^1.5) cat1st:girls cat2nd:girls cat3rd:girls 
 or if you're interested in 'boys': 
 -type:H +(title:dna body:dna^1.5) cat1st:boys cat2nd:boys cat3rd:boys
 
 Disadvantages of the current approach:
 - loss of precision encoding/decoding boosts (performance is important,
 so this might be acceptable)
 - using TermQuery for the cat fields doesn't make a lot of sense since
 the external scores are multiplied by the idf of 'boys'/'girls' and
 the querynorm
 - the resulting score from the cat field is added to the other query
 score instead of multiplied
 
 Just to give you an idea: the index I'm using is growing in time and
 contains about 50 million documents
 
 Do you have an idea how I can improve my query and still keep high
 performance? Or should I combine the scores in the Collector (but this
 doesn't seem the right place to retrieve the category scores from the
 index)? Is it possible to use a different float-byte encoder per field
 to reduce the lack of precision?
 
 Thanks for your time,
 Dennis




RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba,

The problem is that Lucene only knows how to handle character strings, not 
numbers.  Lexicographically, 3  10, so you get the expected results 
(nothing).

The standard thing to do is transform your numbers into strings that sort as 
you want them to.  E.g., you can left-pad the rank field values with zeroes: 
03, 04, ..., 10, and then create a RangeFilter over 03 .. 10.  You 
will of course need to left-zero-pad to at least the maximum character length 
of the largest rank.

Facilities to handle this problem are available in NumberTools:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/NumberTools.html

(Note that NumberTools converts longs to base-36 fixed-length padded strings.)

More info here:

   http://wiki.apache.org/lucene-java/SearchNumericalFields

Steve

On 01/13/2010 at 12:51 PM, AlexElba wrote:
 
 Hello,
 
 I am currently using lucene 2.4 and have document with 3 fields
 
 id
 name
 rank
 
 and have query and filter when I am trying to use rang filter on rank I
 am not getting any result back
 
 RangeFilter rangeFilter = new RangeFilter(rank, 3, 10, true, true);
 
 I have documents which are in this interval
 
 
 Any suggestion what am I doing wrong?
 
 Regards

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba,

Did you completely re-index?

If you did, then there is some other problem - can you share (more of) your 
code?

Do you know about Luke?  It's an essential tool for Lucene index debugging:

   http://www.getopt.org/luke/

Steve

On 01/13/2010 at 8:34 PM, AlexElba wrote:
 
 Hello,
 
 I change filter to follow
   RangeFilter rangeFilter = new RangeFilter(
rank, NumberTools
 .longToString(rating), NumberTools
 .longToString(10), true, true);
 
 and change index to store rank the same way... But still not seeing :(
 any results
 
 
 AlexElba wrote:
  
  Hello,
  
  I am currently using lucene 2.4 and have document with 3 fields
  
  id
  name
  rank
  
  and have query and filter when I am trying to use rang filter on rank I
  am not getting any result back
  
  RangeFilter rangeFilter = new RangeFilter(rank, 3, 10, true,
  true);
  
  I have documents which are in this interval
  
  
  Any suggestion what am I doing wrong?
  
  Regards
  
  
  
  
  
 
 -- View this message in context: http://old.nabble.com/RangeFilter-
 tp27148785p27155102.html Sent from the Lucene - Java Users mailing list
 archive at Nabble.com.
 
 
 - To
 unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For
 additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: TopFieldDocCollector and v3.0.0

2009-12-08 Thread Steven A Rowe
Hi Uwe,

On 12/08/2009 at 9:40 AM, Uwe Schindler wrote:
 After the move to 3.0, you can (but you must not) further update
 your code to use generics, which is not really needed but will
 remove all compiler warnings.

This sounds like you're telling people that although they are able to update 
their code to use generics, it is forbidden.

I'm sure, though, that you mean that they are not required to do so: something 
like but you need not rather than but you must not.

Steve

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Hits and TopDoc

2009-10-20 Thread Steven A Rowe
Hi Nathan,

On 10/20/2009 at 5:03 PM, Nathan Howard wrote:
 This is sort of related to the above question, but I'm trying to update
 some (now depricated) Java/Lucene code that I've become aware of once we
 started using 2.4.1 (we were previously using 2.3.2):
 
 Hits results = MultiSearcher.search(Query));
 
 int start = currentPage * resultsPerPage;
 int stop = (currentPage + 1) * resultsPerPage();
 
 for(int x = start; (x  searchResults.length())  (x  stop); x++)
 {
 Document doc = searchResults.doc(x);
 // do search post-processing with the Document
 }
 
 Results per page is normally small (10ish or so).
 
 I'm having difficulty figuring out how to get TopDocs to replicate this
 paging functionality (which the application must maintain).

From 
http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Hits.html:
=
Deprecated. Hits will be removed in Lucene 3.0.

Instead e. g. TopDocCollector and TopDocs can be used:

   TopDocCollector collector = new TopDocCollector(hitsPerPage);
   searcher.search(query, collector);
   ScoreDoc[] hits = collector.topDocs().scoreDocs;
   for (int i = 0; i  hits.length; i++) {
 int docId = hits[i].doc;
 Document d = searcher.doc(docId);
 // do something with current hit
 ...
=

Construct the TopDocCollector with your stop variable instead of 
hitsPerPage, initialize the loop control variable with the value of your 
start variable instead of 0, and you should be good to go.

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Hits and TopDoc

2009-10-20 Thread Steven A Rowe
Hi Yonik,

Hmm, in what version of Hits do you see this updated javadoc?  In the 2.9.0 
version, the only change in the Hits javadoc from the 2.4.1 version in this 
section is that it refers to TopScoreDocCollector instead of TopDocCollector:

http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Hits.html

And, of course, Hits has now been removed from trunk as part of the deprecation 
cleansing ritual.

Steve

On 10/20/2009 at 5:43 PM, Yonik Seeley wrote:
 Hmm, yes, I should have thought of quoting the havadoc :-)
 The Hits javadoc has been udpated though... we shouldn't be pushing
 people toward collectors unless they really need them:
 
  *   TopDocs topDocs = searcher.search(query, numHits);
  *   ScoreDoc[] hits = topDocs.scoreDocs;
  *   for (int i = 0; i  hits.length; i++) {
  * int docId = hits[i].doc;
  * Document d = searcher.doc(docId);
  * // do something with current hit
 
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
 On Tue, Oct 20, 2009 at 5:27 PM, Steven A Rowe sar...@syr.edu wrote:
  Hi Nathan,
 
  On 10/20/2009 at 5:03 PM, Nathan Howard wrote:
  This is sort of related to the above question, but I'm trying to
 update
  some (now depricated) Java/Lucene code that I've become aware of
 once we
  started using 2.4.1 (we were previously using 2.3.2):
 
  Hits results = MultiSearcher.search(Query));
 
  int start = currentPage * resultsPerPage;
  int stop = (currentPage + 1) * resultsPerPage();
 
  for(int x = start; (x  searchResults.length())  (x  stop); x++)
  {
      Document doc = searchResults.doc(x);
      // do search post-processing with the Document
  }
 
  Results per page is normally small (10ish or so).
 
  I'm having difficulty figuring out how to get TopDocs to replicate
 this
  paging functionality (which the application must maintain).
 
  From
 http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/
 Hits.html:
  =
  Deprecated. Hits will be removed in Lucene 3.0.
 
  Instead e. g. TopDocCollector and TopDocs can be used:
 
    TopDocCollector collector = new TopDocCollector(hitsPerPage);
    searcher.search(query, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
    for (int i = 0; i  hits.length; i++) {
      int docId = hits[i].doc;
      Document d = searcher.doc(docId);
      // do something with current hit
      ...
  =
 
  Construct the TopDocCollector with your stop variable instead of
 hitsPerPage, initialize the loop control variable with the value of
 your start variable instead of 0, and you should be good to go.
 
  Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   >