Re: Twitter analyser

2013-11-08 Thread Lance Norskog
This is a parts-of-speech analyzer for tweets. It would make your index 
far more useful.


http://www.ark.cs.cmu.edu/TweetNLP/

On 11/04/2013 11:40 PM, Stéphane Nicoll wrote:

Hi,

I am building an application that indexes tweet and offer some basic
search facilities on them.

I am trying to find a combination where the following would work:

* foo matches the foo word, a mention (@foo) or the hashtag (#foo)
* @foo only matches the mention
* #foo matches only the hashtag

It should matches complete word so I used the WhiteSpaceAnalyzer for indexing.

Any recommendation for this use case?

Thanks !
S.

Sent from my iPhone

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: JLemmaGen project

2013-11-04 Thread Lance Norskog
This is very cool! Lemmatization is an important tool for making search 
work better.


Would you consider changing the licensing to the Apache 2.0 license?

On 10/23/2013 08:17 AM, Michal Hlavac wrote:

Hi,

I rewrote lemmatizer project LemmaGen (http://lemmatise.ijs.si/) to java. 
Originally it's written in C#.
Lemmagen project uses rules to lemmatize word. Algorithm is described here:
http://lemmatise.ijs.si/Download/File/Documentation%23JournalPaper.pdf

Project is writtten under GPLv3. Sources are located on bitbucket server:
https://bitbucket.org/hlavki/jlemmagen

There is also Lemmagen4j project which use more memory and without prebuilded 
trees.

I obtained also licenced dictionaries to build rules tree for 15 languages. 
Dictionaries are licenced, but prebuilded trees don't.
But you can also build your own dictionary.

Project contains also TokenFilter for lucene/solr.
Project is not stable, but any feedback is appreciated.

Supported languages are:
mlteast-bg - Bulgarian
mlteast-cs - Czech
mlteast-en - English
mlteast-et - Estonian
mlteast-fr - French
mlteast-hu - Hungarian
mlteast-mk - Macedonia
mlteast-pl - Polish
mlteast-ro - Romanian
mlteast-ru - Russian
mlteast-sk - Slovak
mlteast-sl - Slovene
mlteast-sr - Serbian
mlteast-uk - Ukrainian

thanks, miso


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: posting list strings

2013-07-14 Thread Lance Norskog
Is there a Trie-based term index? Seems like this would be smaller, and 
very fast on non-leading wildcards.


On 07/09/2013 02:34 PM, Uwe Schindler wrote:

Hi,

You can replace the term by their hash directly in the analyzer chain. Just 
write a custom TermToBytesRef attribute that hashes the term to a 
constant-length byte[] (using a AttributeFactory)! :-) This would give you all 
features of hashed, constant length terms, but you would lose prefix and 
wildcard queries. In fact, NumericTokenStream is doing this for numeric!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, July 09, 2013 11:25 PM
To: java-user@lucene.apache.org
Subject: Re: posting list strings

Hi,

Lucene stores the string because it may need it to run prefix or range
queries. We don't have a hash-based terms dictionary right now but I know
some people wrote one since they don't need support for these queries, see
for instance the Earlybird paper[1]. Then if you can find a perfect hashing
function, you can just replace your terms by their hash.

[1]
http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.
pdf

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: In memory index (current status in Lucene)

2013-07-01 Thread Lance Norskog
My current open source project is a Directory that is just like 
RAMDirectory, but everything is memory-mapped. The idea is it creates a 
disk file, opens it, and immediately deletes the file. The file still 
exists until the IndexReader/Writer/Searcher closes it. But, it cannot 
be found from the file system. This is just like a RAMDirectory, but 
without memory limitations.


It's proving to be harder than it looked.

The application is to store encrypted indexes in memory, with the 
decrypted contents in this non-findable format. I'm in medical document 
analysis now, and we can't store anything on disk in the clear.


Lance

On 07/01/2013 07:07 AM, Emmanuel Espina wrote:

Hi Erick! Nice to hear from you again! From time to time my interest
in these Lucene things returns and I do some experiments :p

Just to add to this conversation, I found an interesting link to
Mike's blog about memory resident indexes (using another virtual
machine) 
http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html
and also (which is not exactly what I asked but seems related) there
is a Google Summer of Code project to build a memory residen term
resident: 
http://www.google-melange.com/gsoc/project/google/gsoc2013/billybob/42001

Thanks
Emmanuel


2013/7/1 Erick Erickson erickerick...@gmail.com:

Hey Emma! It's been a while

Building on what Steven said, here's Uwe's blog on
MMapDirectory and Lucene:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

I've always considered RAMDirectory for rather restricted
use-cases. I.e. if I know without doubt that the index
is both relatively static and bounded. The other use I've
seen is to use it to index single documents on-the-fly for
some reason (say complex processing of a single result)
then throw it out afterwards.

How are things going?

Erick



On Fri, Jun 28, 2013 at 5:36 PM, Steven Schlansker ste...@likeness.comwrote:


On Jun 28, 2013, at 2:29 PM, Emmanuel Espina espinaemman...@gmail.com
wrote:


I'm building a distributed index (mostly as a reasearch project for
school) and I'm evaluating indexing the entire collection in memory
(like google, facebook and others have done years ago). The obvious
reason for this is performance considering that the replication will
give me a reasonably good durability of the data (despite being in
volatile memory).

What is the current status of Lucene for this kind of indexes?
RAMDirectory in it's documentation has a scary warning that says that
is not intended to work with huge indexes, and that sounds more like
it is an implementation for testing rather than something for
production.

Of course there is no real context for this question, because it is a
reasearch topic. Testing it's limits would be the closest to a context
I have :p

You could consider MMapDirectory, which will end up putting the active
portions
of the index in memory (via the filesystem buffer cache).

The benefit is that you don't completely destroy the Java heap
(RAMDirectory causes immense
GC pressure if you are not careful) and you don't have to commit all of
your ram to index usage all the time.

The downside is that if your working set exceeds the amount of RAM
available for buffer cache, you will get silent performance degradation as
you fall back to disk reads for the missing blocks.

Maybe this is OK for your use case, maybe not.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Content based recommender using lucene/solr

2013-06-29 Thread Lance Norskog

Solr/Lucene has two features for this:
1) the MoreLikeThis code, and
2) the clustering project in solr/contrib.

Lance

On 06/28/2013 11:15 AM, Luis Carlos Guerrero Covo wrote:

I only have about a million docs right now so scaling is not a big issue.
I'm looking to provide a quick implementation and then worry about scale
when I get around to implementing a more robust recommender. I'm looking at
a content based approach because we are not tracking users and items viewed
by users. I was thinking of using morelikethis like walter mentioned, but
wanted some feedback on the nuances required for a proper implementation
like having a similarity based on euclidean distance, normalizing numerical
field values and computing collection wide stats like mean and variance.
Thank you for the link Otis, I will watch it right away.


On Fri, Jun 28, 2013 at 1:12 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:


Hi,

It doesn't have to be one or the other.  In the past I've built a news
recommender engine based on CF (Mahout) and combined it with Content
Similarity-based engine (wasn't Solr/Lucene, but something custom that
worked with ngrams, but it may have as well been Lucene/Solr/ES).  It
worked well.  If you haven't worked with Mahout before I'd suggest the
approach in that video and going from there to Mahout only if it's
limiting.

See Ted's stuff on this topic, too:
http://www.slideshare.net/tdunning/search-as-recommendation +
http://berlinbuzzwords.de/sessions/multi-modal-recommendation-algorithms
(note: Mahout, Solr, Pig)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Jun 28, 2013 at 2:07 PM, Saikat Kanjilal sxk1...@hotmail.com
wrote:

You could build a custom recommender in mahout to accomplish this, also

just out of curiosity why the content based approach as opposed to building
a recommender based on co-occurence.  One other thing, what is your data
size, are you looking at scale where you need something like hadoop?

From: lcguerreroc...@gmail.com
Date: Fri, 28 Jun 2013 13:02:00 -0500
Subject: Re: Content based recommender using lucene/solr
To: solr-u...@lucene.apache.org
CC: java-user@lucene.apache.org

Hey saikat, thanks for your suggestion. I've looked into mahout and

other

alternatives for computing k nearest neighbors. I would have to run a

job

and computer the k nearest neighbors and track them in the index for
retrieval. I wanted to see if this was something I could do with lucene
using lucene's scoring function and solr's morelikethis component. The

job

you specifically mention is for Item based recommendation which would
require me to track the different items users have viewed. I'm looking

for

a content based approach where I would use a distance measure to

establish

how near items are (how similar) and have some kind of training phase to
adjust weights.


On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.com

wrote:

Why not just use mahout to do this, there is an item similarity

algorithm

in mahout that does exactly this :)




https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

You can use mahout in distributed and non-distributed mode as well.


From: lcguerreroc...@gmail.com
Date: Fri, 28 Jun 2013 12:16:57 -0500
Subject: Content based recommender using lucene/solr
To: solr-u...@lucene.apache.org; java-user@lucene.apache.org

Hi,

I'm using lucene and solr right now in a production environment

with an

index of about a million docs. I'm working on a recommender that

basically

would list the n most similar items to the user based on the

current item

he is viewing.

I've been thinking of using solr/lucene since I already have all

docs

available and I want a quick version that can be deployed while we

work

on

a more robust recommender. How about overriding the default

similarity so

that it scores documents based on the euclidean distance of

normalized

item

attributes and then using a morelikethis component to pass in the
attributes of the item for which I want to generate

recommendations? I

know

it has its issues like recomputing scores/normalization/weight

application

at query time which could make this idea unfeasible/impractical.

I'm at a

very preliminary stage right now with this and would love some

suggestions

from experienced users.

thank you,

Luis Guerrero





--
Luis Carlos Guerrero Covo
M.S. Computer Engineering
(57) 3183542047






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Please add me as a wiki editor

2013-06-09 Thread Lance Norskog

I'm responsible for the OpenNLP wiki page:
https://wiki.apache.org/solr/OpenNLP

Please add me to the list of editors.


Re: Taking backup of a Lucene index

2013-06-05 Thread Lance Norskog
The simple answer (that somehow nobody gave) is that you can make a copy 
of an index directory at any time. Indexes are changed in generations. 
The segment* files describe the current generation of files. All active 
indexing goes on in new files. In a commit, all new files are flushed to 
disk and then the segment* files change. At any point in this sequence, 
all of the files in the directory form one consistent index.


This isn't like MySQL or other databases where you have to shut down the 
DB to get a safe copy of the files.


Lance

On 04/17/2013 03:57 AM, Ashish Sarna wrote:

I want to take back-up of a Lucene index. I need to ensure that index files
would not change when I take their backup.

  


I am concerned about the housekeeping/merge/optimization activities which
Lucene performs internally. I am not sure when/how these activities are
performed by Lucene and how we can prevent them.

  


My application (which allows indexing and searching over the created
indexes) keeps running in the background. I can ensure that nothing is
written to the indexes by my application when I take their backup, but I am
not sure whether indexes would change in some manner when a search is
performed over it.

  


How can I ensure that an index would not change (i.e., no
housekeeping/merge/optimization activity is performed by Lucene) when I take
its backup?

  


Any help would be much appreciated.

  


PS: Currently I am using Lucene 2.9.4 but wish to upgrade it to 3.6.2.

  


Regards

Ashish





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Zero-position query?

2013-06-03 Thread Lance Norskog

Thanks! Now, to hunt for this in the parsers.

On 06/02/2013 09:16 PM, Israel Tsadok wrote:

You can do this with a PhraseQuery[1]. Just add more terms with position 0.

[1] 
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/PhraseQuery.html#add(org.apache.lucene.index.Term,
int)


On Mon, Jun 3, 2013 at 6:46 AM, Lance Norskog goks...@gmail.com wrote:


What is a Lucene query that will find two words at the same term position?
Is there a class that will do this? Is the feature available from the
Lucene query syntax or any other syntax parsers?

For example, if I'm using synonyms at index time I should get the base
word and all synonyms at the same position. What is a query that will find
a document with the synonym substituted, but will not find a document which
has the base word and a synonym at two different positions?

Thanks,

Lance.

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Zero-position query?

2013-06-02 Thread Lance Norskog
What is a Lucene query that will find two words at the same term 
position? Is there a class that will do this? Is the feature available 
from the Lucene query syntax or any other syntax parsers?


For example, if I'm using synonyms at index time I should get the base 
word and all synonyms at the same position. What is a query that will 
find a document with the synonym substituted, but will not find a 
document which has the base word and a synonym at two different positions?


Thanks,

Lance.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StandardAnalyzer: Support for Japanese

2013-01-14 Thread Lance Norskog
3.x and 4.0 Solr releases have nice analyzers just for Japanese. In 4.0 
they are the Kuromoji package.


In 4.0, the JapaneseAnalyzer probably does what you need:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-kuromoji/4.0.0/org/apache/lucene/analysis/ja/JapaneseAnalyzer.java?av=f

3.6 also has the Kuromoji package, but I don't know how advanced it is 
compared the 4.x version.


Cheers!

On 01/10/2013 11:19 AM, saisantoshi wrote:

We are using StandardAnalyzer for indexing some Japanese Keywords. It works
fine so far but just wanted to confirm if the StandardAnalyzer can fully
support it ( I have read somewhere in Lucene In Action book, that
StandardAnalyzer does support CJK). Just want to confirm if my understanding
is correct? or do we need to use a specific analyzer for processing Japanese
Keywords.

Alternatively, is there a stop words list for Japanese Language so that we
can add an extra filter to the Standard Analyzer.

Any thoughts on this is much appreciated.

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/StandardAnalyzer-Support-for-Japanese-tp4032290.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: potential memory leak when using RAMDirectory ,CloseableThreadLocal and a thread pool .

2013-01-02 Thread Lance Norskog
There were memory leak problems with earlier versions of Java. You 
should upgrade to Java 6_30.


Lance

On 01/02/2013 05:26 AM, Alon Muchnick wrote:

Hello All ,
we are using Lucune 3.6.2 in our web application on tomcat 5.5 and recently
we started testing our application on tomcat 7, unfortunately we seem to
encounter a memory link in Lucune's  CloseableThreadLocal class , any help
with solving the below issue would be much appreciated.


we are using RAMDirectory for our Indexes , while testing the application
on tomcat 7 we noticed that there is a memory leak in our application ,
after taking a heap dump we can see the memory leak is in the:



Type |Name|Value

ref |   index   |  org.apache.lucene.store.RAMDirectory ---



ref |   core|
org.apache.lucene.index.SegmentReader$CoreReaders   ---


--
ref |   tis   |
org.apache.lucene.index.TermInfosReader  ---
--


-
ref |   threadResources   |  org.apache.lucene.util.CloseableThreadLocal
---
-



ref  |   hardRefs  |  java.util.HashMap @ 0x9d566938


i guess the HashMap is used for caching purposes and it hold entries where
the key is a thread name and the value is a
org.apache.lucene.index.TermInfosReader$ThreadResources object .

*even when i stop new incoming connection to the application , tomcat
closes all the active threads and a GC is run the above map size is not
reduced and GC cannot reclaim the heap space .
*
the problem looks  some what similar to LUCENE-3841
https://issues.apache.org/jira/browse/LUCENE-3841
but we are no using SnowballAnalyzer .

( i checked the code and made sure the hardRefs map is a WeakHashMap )


our JVM is :

OpenJDK 64-Bit Server VM ,
java.runtime.version1.6.0_20-b20


one again any help would be much appreciated.

thanks

Alon




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Pulling lucene 4.1

2013-01-02 Thread Lance Norskog
4.x does not promise backwards compatibility with 3.x. Have you made 
your own extensions?


On 01/02/2013 04:38 AM, Shai Erera wrote:

There's no specific branch for 4.1 yet. All development still happens on
the 4x branch (
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/).

Note that Lucene maintains two active branches for development: 'trunk'
(currently to be 5.0) and '4x' off of which all Lucene 4.x releases are
created.

Shai


On Wed, Jan 2, 2013 at 11:57 AM, Ramprakash Ramamoorthy 
youngestachie...@gmail.com wrote:


Dear all,

Would be glad to know on which branch of lucene is the
development happening on version 4.1. Would be glad if you can share the
repo URL, we are testing out certain features of 4.1 including
CompressingStoredFieldsFormat.

Currently we are pulling from trunk, which I guess is 5.x
branch. Very particular about 4.1 because, we need backward compatibility
with 3.x. Thanks in advance.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which token filter can combine 2 terms into 1?

2012-12-28 Thread Lance Norskog
How do you choose t2 and t2a? If you have a full inventory of these 
pairs, you can make these multi-word synonyms and use the Synonym filter 
to combine them.


On 12/20/2012 11:50 PM, Xi Shen wrote:

Hi,

I am looking for a token filter that can combine 2 terms into 1? E.g.

the input has been tokenized by white space:

t1 t2 t2a t3

I want a filter that output:

t1 t2t2a t3

I know it is a very special case, and I am thinking about develop a filter
of my own. But I cannot figure out which API I should use to look for terms
in a Token Stream.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to implement a TokenFilter?

2012-12-26 Thread Lance Norskog

Go to the top directory and do this:
cp dev-tools/eclipse/dot.project .project
cp dev-tools/eclipse/dot.classpath .classpath
cp -r dev-tools/eclipse/dot.settings .settings

The 'ant eclipse' target does this setup.

On 12/24/2012 10:45 PM, Xi Shen wrote:

Hi Lance,

I got the lucene 4 from
http://mirror.bjtu.edu.cn/apache/lucene/java/4.0.0/lucene-4.0.0-src.tgz, it
is an Ant project. But I do not which IDE can import it...I tried Eclipse,
it cannot import the build.xml file.


Thanks,
D.


On Mon, Dec 24, 2012 at 12:02 PM, Lance Norskog goks...@gmail.com wrote:


You need to use an IDE. Find the Attribute type and show all subclasses.
This shows a lot of rare ones and a few which are used a lot. Now, look at
source code for various TokenFilters and search for other uses of the
Attributes you find. This generally is how I figured it out.

Also, after the full Analyzer stack is called, the caller saves the output
(I guess to codecs?). You can look at which Attributes it saves.


On 12/23/2012 06:30 PM, Xi Shen wrote:


thanks a lot :)


On Mon, Dec 24, 2012 at 10:22 AM, feng lu amuseme...@gmail.com wrote:

  hi Shen

May be you can see some source code in org.apache.lucene.analysis
package,
such LowerCaseFilter.java,**StopFilter.java and so on.

and some common attribute includes:

offsetAtt = addAttribute(OffsetAttribute.**class);
termAtt = addAttribute(**CharTermAttribute.class);
typeAtt = addAttribute(TypeAttribute.**class);

Regards


On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć r@solr.pl wrote:

  Hello!

The simplest way is to look at Lucene javadoc and see what
implementations of Attribute interface there are -

  http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**

util/Attribute.htmlhttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/Attribute.html


--
Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch

  thanks, i read this ready. it is useful, but it is too 'small'...

e.g. for this.charTermAttr = addAttribute(**CharTermAttribute.class);
i want to know what are the other attributes i need in order to


implement
my function. where i can find a references to these attributes? i tried
on


lucene  solr wiki, but all i found is a list of the names of these
attributes, nothing about what are they capable of...




  On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć r@solr.pl wrote:

Hello!

A small example with some explanation can be found here:
http://solr.pl/en/2012/05/14/**developing-your-own-solr-**filter/http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/

--
Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch

  Hi,

I need a guide to implement my own TokenFilter. I checked the wiki,


but I

could not find any useful guide :(


--**--**
-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org




--**--**
-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org




--
Don't Grow Old, Grow Up... :-)





--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to implement a TokenFilter?

2012-12-23 Thread Lance Norskog
You need to use an IDE. Find the Attribute type and show all subclasses. 
This shows a lot of rare ones and a few which are used a lot. Now, look 
at source code for various TokenFilters and search for other uses of the 
Attributes you find. This generally is how I figured it out.


Also, after the full Analyzer stack is called, the caller saves the 
output (I guess to codecs?). You can look at which Attributes it saves.


On 12/23/2012 06:30 PM, Xi Shen wrote:

thanks a lot :)


On Mon, Dec 24, 2012 at 10:22 AM, feng lu amuseme...@gmail.com wrote:


hi Shen

May be you can see some source code in org.apache.lucene.analysis package,
such LowerCaseFilter.java,StopFilter.java and so on.

and some common attribute includes:

offsetAtt = addAttribute(OffsetAttribute.class);
termAtt = addAttribute(CharTermAttribute.class);
typeAtt = addAttribute(TypeAttribute.class);

Regards


On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć r@solr.pl wrote:


Hello!

The simplest way is to look at Lucene javadoc and see what
implementations of Attribute interface there are -


http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/Attribute.html

--
Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch


thanks, i read this ready. it is useful, but it is too 'small'...
e.g. for this.charTermAttr = addAttribute(CharTermAttribute.class);
i want to know what are the other attributes i need in order to

implement

my function. where i can find a references to these attributes? i tried

on

lucene  solr wiki, but all i found is a list of the names of these
attributes, nothing about what are they capable of...





On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć r@solr.pl wrote:

Hello!

A small example with some explanation can be found here:
http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/

--
Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch


Hi,
I need a guide to implement my own TokenFilter. I checked the wiki,

but I

could not find any useful guide :(



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Don't Grow Old, Grow Up... :-)







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does 
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an 
Apache project for natural-language processing.


Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899

On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog
I should not have added that note. The Opennlp patch gives a concrete 
example of adding an annotation to text.


On 12/13/2012 01:54 PM, Glen Newton wrote:

It is not clear this is exactly what is needed/being discussed.

 From the issue:
We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position.

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote:

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
project for natural-language processing.

Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899


On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

For example:
   - part of speech of a token.
   - syntactic parse subtree (over a span).
   - semantically normalized phrase (to canonical text or ontological
code).
   - semantic group (of a span).
   - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which stemmer?

2012-11-16 Thread Lance Norskog
Nope! This slang term only exists in the plural. The kind of prose with this 
usage may not follow standard grammatical and spelling rules anyway. 
Historically, text search has been funded mostly by the US intelligence 
agencies because they want to analyze formal and technical prose. And, it is 
coded by people who think in good grammar, and are perfect spellers.

If you find 'too aggressive' and 'too mild' to be a problem, what you want is 
'lemmatization' where you work from a dictionary of word forms. Solr supports 
using Wordnet for this purpose.

Lance

- Original Message -
| From: Igal @ getRailo.org i...@getrailo.org
| To: java-user@lucene.apache.org
| Sent: Friday, November 16, 2012 4:18:20 PM
| Subject: Re: Which stemmer?
| 
| but if dogs are feet (and I guess I fall into the not-perfect group
| here)...  and feet is the plural form of foot, then shouldn't
| dogs
| be stemmed to dog as a base, singular form?
| 
| 
| 
| On 11/16/2012 2:32 PM, Tom Burton-West wrote:
|  Hi Mike,
| 
|  Honestly I've never heard of anyone using dogs to mean feet
|  either, but
|  hey nobody's perfect.
| 
|  This is really off topic but I couldn't resist.  This usage of
|  dogs to
|  mean feet occurs in old blues lyrics such as Blind Lemon
|  Jefferson's Hot
|  Dogs
|  http://www.youtube.com/watch?v=v670qVwzm9c
|  (Hard to make out what he's singing on the old 78, but he's says
|  his dogs
|  is red hot, meaning he can run really fast.)
|  http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/
| 
|  Tom
| 
| 
| 
| -
| To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
| For additional commands, e-mail: java-user-h...@lucene.apache.org
| 
| 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 delete by ID

2012-10-28 Thread Lance Norskog
Scott, did you mean the Lucene integer id, or the unique id field?

- Original Message -
| From: Martijn v Groningen martijn.v.gronin...@gmail.com
| To: java-user@lucene.apache.org
| Sent: Sunday, October 28, 2012 2:24:29 PM
| Subject: Re: Lucene 4.0 delete by ID
| 
| A top level document ID can change over time. For that reason you
| shouldn't rely on it. However if you know your index is stable or you
| keep track when a merge happes, you can use the
| IndexWriter#tryDeleteDocument method to delete a document by Lucene
| id. Deleting a document via a IndexReader is no longer possible.
| 
| Martijn
| 
| On 27 October 2012 01:47, Mossaab Bagdouri
| bagdouri_moss...@yahoo.fr wrote:
|  Lucene document IDs are not stable. You could add a field with an
|  ID that
|  you maintain. Your query would then be just a TermQuery on the ID.
| 
|  Regards,
|  Mossaab
| 
| 
|  2012/10/26 Scott Smith ssm...@mainstreamdata.com
| 
|  I'm currently converting some lucene code to 4.0.  It appears that
|  you are
|  no longer allowed to delete a document by its ID.  Is that
|  correct?  Is my
|  only option to figure some kind of query (which obviously isn't
|  based on
|  ID) and do the delete from there?
| 
| 
| 
| 
| --
| Met vriendelijke groet,
| 
| Martijn van Groningen
| 
| -
| To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
| For additional commands, e-mail: java-user-h...@lucene.apache.org
| 
| 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A large number of files in an index (3.6)

2012-10-28 Thread Lance Norskog
An option: instead of merging continuously as you run, you can optimize with 
'maxSegments=10'. This mean 'optimize but only until there are 10 segments'. If 
there are fewer than 10 segments, nothing happens. This lets you schedule 
merging I/O.

Is the number of files a problem due to file space breakage?

- Original Message -
| From: kiwi clive kiwi_cl...@yahoo.com
| To: java-user@lucene.apache.org
| Sent: Saturday, October 27, 2012 12:44:34 PM
| Subject: A large number of files in an index (3.6)
| 
| Hi guys,
| 
| I've recently moved from lucene 2.3 to 3.6. The application uses CF
| format. With lucene 2.3, I understood the interaction of merge
| factor etc with repect to how many files were created in the index
| directory. With a merge factor of 10, the number of files in the
| index directory could sometimes get up to 30, but you can see the
| merging happen and  the numeber of files would roll up after a while
| and settle around 10-15.
| 
| 
| With lucene 3.6, this is not the case. Firstly, even with MergePolicy
| set to useCFS, the index appears to be a hybrid of cfs and raw index
| format. I can understand that may have been done for performance
| reasons, but it does increase the file count considerably. Also the
| rollup of the merged segments is not occurring as it did on the
| previous version.  Originally I set the CFSRatio to 1.0 and found
| the behaviour similar to lucene2.3 (file number wise) but this came
| at a i/o cost and the machines ran with a higher load average. The
| higher i/o starts to affect query performance.  Reducing cfsRatio to
| 0.1 (default), helped reduce i/o load but I  am running several
| thousand concurrent indexes across many disks on the  servers and
| the larger number of files per index means a large number of files
| are being opened when a query hits the index, in addition to the
| indexing load.
| 
| I'm sure this is probably down to Merge policies and schedules, but
| there are quite a few knobs to tweak here so some guidance as to the
| the most beneficial parameters to tweak would be very helpful.
| 
| I'm using the LogByteSizeMergePolicy with 3 background merge threads.
| I'm considering using TieredMergePolicy and even reducing the number
| of merge threads, but there is not much point if it does not roll up
| the segments as expected. I can tweak with the cfsRatio but this
| strikes me a large hammer and there may be more subtle ways to do
| this !
| 
| So tell me I'm being stupid, just say 'derr- why dont you do
| this' and I'll be a happy man!!
| 
| Thanks,
| Clive

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Efficient string lookup using Lucene

2012-08-26 Thread Lance Norskog
The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
After that, you can wildcards. This will use very little space. I
believe leadingtrailing wildcards are supported now, right?

On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin izavo...@caci.com wrote:
 The user uploads a set of text files, either all of them at once or one at a 
 time, and then they will be searched locally on the phone against a set of 
 hotlist words. This assumes no connection to any sort of server so 
 everything must be done locally.

 I already have Lucene integrated so I might want to try the n-gram approach. 
 But I just want to double-check first that it will work with any Unicode 
 string, be it an English word, a foreign word, a sequence of digits or any 
 random sequence of Unicode characters. In other words, this is not in any way 
 language-dependent/-specific.

 Thanks,

 Ilya

 -Original Message-
 From: Dawid Weiss [mailto:dawid.we...@gmail.com]
 Sent: Sunday, August 26, 2012 3:55 AM
 To: java-user@lucene.apache.org
 Subject: Re: Efficient string lookup using Lucene

 Does Lucene support this type of structure, or do I need to somehow 
 implement it outside Lucene?

 You'd have to implement it separately but it'd be much, much smaller than 
 Lucene itself (even obfuscated).

 By the way, I need this to run on an Android phone so size of memory might 
 be an issue...

 How large is your input? Do you need to index on the android or just read the 
 index on it? These are all factors to take into account. I mentioned suffix 
 trees and suffix arrays because these two are canonical data structures to 
 perform any substring lookups in constant time (in fact, the lookup takes the 
 number of elements of the matched input string, building the suffix tree/ 
 array is O(n), at least in theory).

 If you already have Lucene integrated in your pipeline then that n-gram 
 approach will also work. If you know your minimum match substring length to 
 be p then index p-sized shingles. For strings longer than p you can create a 
 query which will search for all n-gram occurrences and take into account 
 positional information to remove false matches.

 Dawid

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: easy way to figure out most common tokens?

2012-08-19 Thread Lance Norskog
You don't need to index the data. Just run the analyzer and maintain
your own counters. This will be disk-bound and will run at your disk
reading speed.

On Sun, Aug 19, 2012 at 5:17 PM, Shaya Potter spot...@gmail.com wrote:
 On 08/19/2012 08:07 PM, Shaya Potter wrote:

 On 08/15/2012 02:34 PM, Ahmet Arslan wrote:

 Is there an easy way to figure out
 the most common tokens and then remove those tokens from the
 documents.


 Probably this :

 http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html


 unsure how to use this

 as far as I can tell org.apache.lucene.misc.TermStats doesn't exist in
 lucene 3.6.1 (there seems to be some class like that in 4.x, but that
 doesn't help me).


 I'm wrong, its there, but eclipse isn't seeing it (haven't tried javac by
 itself), even though it sees HighFreqTerms just fine.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RE: Re:RE: Does the string Cla$$War affect Lucene?

2012-08-14 Thread Lance Norskog


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RAM or SSD...

2012-07-19 Thread Lance Norskog
 You do not want to store 30 G of data in the JVM heap, no matter what library 
 does this.
MMapDirectory does not store data in the JVM heap. It lets the
operating system manage the disk buffer space. Even if the JVM says I
have 30G of memory space, it really does not. It only has address
space allocated by the  OS but no memory.

On Wed, Jul 18, 2012 at 10:39 PM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 On Wed, 2012-07-18 at 17:50 +0200, Dragon Fly wrote:
 If I want to improve performance, which of the following is better and why?

 1. Buy a machine with a lot of RAM and use a RAMDirectory for the index.

 As others has pointed out, MMapDirectory should work better than
 RAMDirectory. I am sure it will work fine with a relative small index
 such as yours. However, it does not scale that well with index size.

 2. Put the index on a solid state drive.

 Why anyone buys computers without SSD's is a mystery to me. Use SSDs for
 the small low-latency stuff and a secondary spinning drive for the large
 slow stuff. Nowadays, a 30GB index (or 100GB for that matter) falls into
 the small low-latency bucket. SSDs speeds up almost everything, saves
 RAM and spares a lot of work hours optimizing I/O-speed.

 Regards,
 Toke Eskildsen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Direct memory footprint of NIOFSDirectory

2012-07-12 Thread Lance Norskog
You can choose another directory implementation.

On Thu, Jul 12, 2012 at 1:42 PM, Vitaly Funstein vfunst...@gmail.com wrote:
 Just thought I'd bump this. To clarify - for reasons outside my
 control, I can't just run the JVM hosting Lucene-enabled application
 with -XX:MaxDirectMemorySize=100G or some other huge value for the
 ceiling and never worry about this. Due to preallocation and other
 restrictions, this parameter has to be fairly close to the actual size
 used by the app (padded for Lucene and possibly other consumers).

 On Mon, Jul 9, 2012 at 7:59 PM, Vitaly Funstein vfunst...@gmail.com wrote:

 Hello,

 I have recently run into the situation when there was not a sufficient 
 amount of direct memory available for IndexWriter to work. This was 
 essentially caused by the embedding application making heavy use of JVM's 
 direct memory buffers and not leaving enough headroom for NIOFSDirectory to 
 operate. So what are the approximate guidelines, if any, in terms of JVM 
 configuration for this choice of Directory to operate safely? Basically, 
 what I am looking for is a rough estimate of direct memory usage per GB of 
 indexed data, or per directory/writer instance, if applicable.

 Thanks,
 -V

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RAMDirectory with FSDirectory merging Versus large mergeFactor and RAMBufferSizeMB

2012-06-05 Thread Lance Norskog
Ramdirectory is no longer an interesting technique for this. It makes
garbage collection do a lot of work. With memory-mapped directory the
data is cached by the OS instead of Java, and OS is very good at this.

TieredMergePolicy is much smarter about time spent merging segments.

Lucene In Action 2 might be more help than a 6-year-old book :)

On Mon, Jun 4, 2012 at 12:47 AM, Maxim Terletsky sx...@yahoo.com wrote:
 Hi guys,
 There are two approaches I see in Lucene In Action about speeding up the 
 indexing process.

 1) Simply increase the mergeFactor and RAMBufferSizeMB.
 2) Using RAMDirectory as a buffer (perhaps even several in parallel) and 
 later merging it using addIndexes to FSDirectory.

 So my question is the following:
 In case I have only 1 thread with RAMDirectory - is that pretty much the same 
 as method 1? Since it's in memory anyhow for large mergeFactor and large 
 RAMBufferSizeMB.

 Maxim


 



-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene (search) performance tuning

2012-05-28 Thread Lance Norskog
Can you use filter queries? Filters short-circuit a lot of search
processing. City:San Francisco is a classic filter - it is a small
part of the documents and it is reused a lot.

On Sat, May 26, 2012 at 7:32 AM, Yang tedd...@gmail.com wrote:
 I'm using disjunction (OR) query. unfortunately all of the clauses are
 optional

 On Sat, May 26, 2012 at 4:38 AM, Simon Willnauer 
 simon.willna...@googlemail.com wrote:

 On Sat, May 26, 2012 at 2:59 AM, Yang tedd...@gmail.com wrote:
  I tested with more threads / processes. indeed this is completely
  cpu-bound, since running 1 thread gives the same latency as 4 threads (my
  box has 4 cores)
 
 
  given this, is there any way to simplify the scoring computation (i'm
 only
  using lucene as a first level rough search, so the search quality is
 not
  a huge issue here) , so that, for example, fewer fields are evaluated or
 a
  simpler scoring function is used?

 are you using disjunction or conjunction queries? Can you make some
 parts of the query mandatory?

 simon
 
  thanks
  Yang
 
  On Fri, May 25, 2012 at 5:47 PM, Yang tedd...@gmail.com wrote:
 
  thanks a lot guys
 
 
  On Tue, May 22, 2012 at 1:34 AM, Ian Lea ian@gmail.com wrote:
 
  Lots of good tips in
  http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, linked from
  the FAQ.
 
 
  --
  Ian.
 
 
  On Tue, May 22, 2012 at 2:08 AM, Li Li fancye...@gmail.com wrote:
   something wrong when writing in my android client.
   if RAMDirectory do not help, i think the bottleneck is cpu. you may
 try
  to
   tune jvm but i do not expect much improvement.
   the best one is splitting your index into 2 or more smaller ones.
   you can then use solr s distributed searching.
   if the cpu is not fully used, yuo can do this in one physical machine
  
   在 2012-5-22 上午8:50,Li Li fancye...@gmail.com写道:
  
  
   在 2012-5-22 凌晨4:59,Yang tedd...@gmail.com写道:
  
   
I'm trying to make my search faster. right now a query like
   
name:Joe Moe Pizza   address:77 main street  city:San Francisco
   is this a conjunction query or a disjunction query?
  
in a index with 20mil such short business descriptions (total size
   about 3GB) takes about 100--200ms.
   20m is not a small size, how many results for a query in average?
  
I profiled the query, most time is spent in TermScorer.score(),
 as is
   shown by the attached yourkit screenshot.
   that's true, for a query, matching and scoring is very time
 consuming
   and cpu intensive. another one is io for reading postings.
  
   
   
   
I tried loading the index onto tmpfs (in-memory block device), and
  also
   tried RAMDirectory, neither helps much.
   if that is true. it seems that io is not the
I am reading
   http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf
it mentions
Size
– Stopword removal
– Stemming
• Lucene has a number of stemmers available
• Light versus Aggressive
• May prevent fine-grained matches in some cases
– Not a linear factor (usually) due to index compression
   
so for stopword removal, I'm already using the standard
 analyzer,
  so
   stop word removal is already included, right?
   
also generally any other tricks to try for reducing the search
  latency?
   
Thanks!
Yang
   
   
   
 -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene (search) performance tuning

2012-05-28 Thread Lance Norskog
And, no RamDirectory does not help.

On Mon, May 28, 2012 at 5:54 PM, Lance Norskog goks...@gmail.com wrote:
 Can you use filter queries? Filters short-circuit a lot of search
 processing. City:San Francisco is a classic filter - it is a small
 part of the documents and it is reused a lot.

 On Sat, May 26, 2012 at 7:32 AM, Yang tedd...@gmail.com wrote:
 I'm using disjunction (OR) query. unfortunately all of the clauses are
 optional

 On Sat, May 26, 2012 at 4:38 AM, Simon Willnauer 
 simon.willna...@googlemail.com wrote:

 On Sat, May 26, 2012 at 2:59 AM, Yang tedd...@gmail.com wrote:
  I tested with more threads / processes. indeed this is completely
  cpu-bound, since running 1 thread gives the same latency as 4 threads (my
  box has 4 cores)
 
 
  given this, is there any way to simplify the scoring computation (i'm
 only
  using lucene as a first level rough search, so the search quality is
 not
  a huge issue here) , so that, for example, fewer fields are evaluated or
 a
  simpler scoring function is used?

 are you using disjunction or conjunction queries? Can you make some
 parts of the query mandatory?

 simon
 
  thanks
  Yang
 
  On Fri, May 25, 2012 at 5:47 PM, Yang tedd...@gmail.com wrote:
 
  thanks a lot guys
 
 
  On Tue, May 22, 2012 at 1:34 AM, Ian Lea ian@gmail.com wrote:
 
  Lots of good tips in
  http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, linked from
  the FAQ.
 
 
  --
  Ian.
 
 
  On Tue, May 22, 2012 at 2:08 AM, Li Li fancye...@gmail.com wrote:
   something wrong when writing in my android client.
   if RAMDirectory do not help, i think the bottleneck is cpu. you may
 try
  to
   tune jvm but i do not expect much improvement.
   the best one is splitting your index into 2 or more smaller ones.
   you can then use solr s distributed searching.
   if the cpu is not fully used, yuo can do this in one physical machine
  
   在 2012-5-22 上午8:50,Li Li fancye...@gmail.com写道:
  
  
   在 2012-5-22 凌晨4:59,Yang tedd...@gmail.com写道:
  
   
I'm trying to make my search faster. right now a query like
   
name:Joe Moe Pizza   address:77 main street  city:San Francisco
   is this a conjunction query or a disjunction query?
  
in a index with 20mil such short business descriptions (total size
   about 3GB) takes about 100--200ms.
   20m is not a small size, how many results for a query in average?
  
I profiled the query, most time is spent in TermScorer.score(),
 as is
   shown by the attached yourkit screenshot.
   that's true, for a query, matching and scoring is very time
 consuming
   and cpu intensive. another one is io for reading postings.
  
   
   
   
I tried loading the index onto tmpfs (in-memory block device), and
  also
   tried RAMDirectory, neither helps much.
   if that is true. it seems that io is not the
I am reading
   http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf
it mentions
Size
– Stopword removal
– Stemming
• Lucene has a number of stemmers available
• Light versus Aggressive
• May prevent fine-grained matches in some cases
– Not a linear factor (usually) due to index compression
   
so for stopword removal, I'm already using the standard
 analyzer,
  so
   stop word removal is already included, right?
   
also generally any other tricks to try for reducing the search
  latency?
   
Thanks!
Yang
   
   
   
 -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





 --
 Lance Norskog
 goks...@gmail.com



-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sort runs out of memory

2012-05-23 Thread Lance Norskog
The Trie type can be tuned for range queries v.s. single queries. This
seems to be explained in email and nowhere else:

http://www.lucidimagination.com/search/document/c501f59515a9eece

On Mon, May 21, 2012 at 12:54 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 On Thu, 2012-05-17 at 23:03 +0200, Robert Bart wrote:
 I am running Lucene 3.6 in a system that indexes about 4 billion documents
 across several indexes, and I'm hoping to get documents in order of a
 certain NumericField.

 What is the maximum size on any single index, in terms of number of
 documents? What is the type of the NumericField?

 I've tried using Lucene's Sort implementation, but it looks like it tries
 to do the entire sort in memory by allocating a huge array with space for
 every document in the index.

 The FieldCache allocates an array of length #documents with the same
 type that your NumericField is. The sort itself is of the sliding window
 type, meaning that it only takes up memory lineary to the number of
 documents wanted in the response. Do you require millions of documents
 to be returned as part of a search?

 Sanity check: You do specify the type when performing a sorted search,
 right? If not, the values will be treated as Strings.

  On my index, this quickly runs out of memory.

 Assuming that your largest index is 1B documents and that your
 NumericField is of type Integer, the FieldCache's values for the sort
 should take up 1B * 4 = 4GB. Are you hoping for less?

 Are there any alternatives or better ways of getting documents in order of
 a NumericField for a very large index?

 Be sure to select the type of NumericField to be as small as possible.
 If you have few unique sort values (e.g. 17, 80, 2000 and 5678), you
 might map them down (to 0, 1, 2 and 3 for this example) and store them
 as a byte.

 Currently Lucene only supports atomic types for numerics in the
 FieldCache, so the smallest one is byte. It is possible to use only
 ceil(log2(#unique_values)) bits/document, although that requires a bit
 of custom coding.

 Regards,
 Toke Eskildsen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Clear/Remove attribute from Token

2012-05-14 Thread Lance Norskog
I would like to remove a payload attribute from a token before it is
indexed. PayloadAttribute lets you set the payload to null.
AttributeSource (parent of all Tokens) does not have a 'remove
Attribute' method. You cannot capture the current attribute set with
'getState()' and then monkey with it (at least Eclipse does not show
me its methods).

If I set the payload to null, when the Token is saved in the index,
will a null payload be saved? Or does the payload get quietly dropped?

-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Clear/Remove attribute from Token

2012-05-14 Thread Lance Norskog
With more hunting, the code for this is in
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(int,
int)

The next question is: does a Token need a PositionIncrementAttribute
to be written out? Or can I just tack on a Payload and that is it?
Does it need an Offset also?

On Mon, May 14, 2012 at 1:09 AM, Lance Norskog goks...@gmail.com wrote:
 I would like to remove a payload attribute from a token before it is
 indexed. PayloadAttribute lets you set the payload to null.
 AttributeSource (parent of all Tokens) does not have a 'remove
 Attribute' method. You cannot capture the current attribute set with
 'getState()' and then monkey with it (at least Eclipse does not show
 me its methods).

 If I set the payload to null, when the Token is saved in the index,
 will a null payload be saved? Or does the payload get quietly dropped?

 --
 Lance Norskog
 goks...@gmail.com



-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Here a merge thread, there a merge thread ...

2012-02-25 Thread Lance Norskog
Solr uses TieredMergeScheduler by default now. You might find this
works more smoothly.

On Fri, Feb 24, 2012 at 10:03 AM, Benson Margulies
bimargul...@gmail.com wrote:
 On Fri, Feb 24, 2012 at 10:59 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 This is from ConcurrentMergeScheduler (the default MergeScheduler).

 But, are you sure the threads are sleeping, not exiting?  (They should
 be exiting).

 This merge scheduler starts a new thread when a merge is needed,
 allows that thread to do another merge (if one is immediately
 available), else the thread exits.

 They seem to exit eventually, but not quite as soon as they arrive.



 Mike McCandless

 http://blog.mikemccandless.com

 On Sun, Feb 19, 2012 at 9:05 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 A long-running program of mine (which Uwe's read a model of) slowly
 keeps adding merge threads. I count 22 at the moment. Each one shows
 up, runs for a bit, and then goes to sleep for, seemingly ever. I
 don't do anything explicit to control merging behavior.

 They name themselves Lucene Merge Thread #xxx where xxx is a
 non-contiguous but ever-growing number.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Retrieving large numbers of documents from several disks in parallel

2011-12-22 Thread Lance Norskog
Is each index optimized?

From my vague grasp of Lucene file formats, I think you want to sort
the documents by segment document id, which is the order of documents
on the disk. This lets you materialize documents in their order on the
disk.

Solr (and other apps) generally use a separate thread per task and
separate index reading classes (not sure which any more).

As to the cold-start, how many terms are there? You are loading them
into the field cache, right? Solr has a feature called auto-warming
which automatically runs common queries each time it reopens an index.

On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht p...@hoplahup.net wrote:
 Michael,

 from a physical point of view, it would seem like the order in which the 
 documents are read is very significant for the reading speed (feel the random 
 access jump as being the issue).

 You could:
 - move to ram-disk or ssd to make a difference?
 - use something different than a searcher which might be doing it better 
 (pure speculation: does a hit-collector make a difference?)

 hope it helps.

 paul


 Le 22 déc. 2011 à 03:45, Robert Bart a écrit :

 Hi All,


 I am running Lucene 3.4 in an application that indexes about 1 billion
 factual assertions (Documents) from the web over four separate disks, so
 that each disk has a separate index of about 250 million documents. The
 Documents are relatively small, less than 1KB each. These indexes provide
 data to our web demo (http://openie.cs.washington.edu), where a typical
 search needs to retrieve and materialize as many as 3,000 Documents from
 each index in order to display a page of results to the user.


 In the worst case, a new, uncached query takes around 30 seconds to
 complete, with all four disks IO bottlenecked during most of this time. My
 implementation uses a separate Thread per disk to (1) call
 IndexSearcher.search(Query query, Filter filter, int n) and (2) process the
 Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like
 a long time to retrieve 3,000 small Documents, I am wondering if I am
 overlooking something simple somewhere.


 Is there a better method for retrieving documents in bulk?


 Is there a better way of parallelizing indexes from separate disks than to
 use a MultiReader (which doesn’t seem to parallelize the task of
 materializing Documents)


 Any other suggestions? I have tried some of the basic ideas on the Lucene
 wiki, such as leaving the IndexSearcher open for the life of the process (a
 servlet). Any help would be greatly appreciated!


 Rob


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: semanticvectors

2011-09-07 Thread Lance Norskog
It's kind of a bazooka, but the Mahout project has support for this.

https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation

On Tue, Aug 30, 2011 at 6:24 AM, zarrinkalam f_z...@yahoo.com wrote:

 Dear paul,

 did you use semanticvectors? I couldn't find  appropriate help

 zarrinkalam



 
 From: Paul Libbrecht p...@hoplahup.net
 To: java-user@lucene.apache.org
 Sent: Monday, August 29, 2011 7:28 PM
 Subject: Re: LSI

 Zarrinkalam,

 have a look at semanticvectors.

 paul


 Le 29 août 2011 à 15:55, zarrinkalam a écrit :

  hi,
 
  I want to use LSI for clustring ducuments indexed with lucene, I dont
 know how, plz help me
 
  thanks,


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com


Re: RAMDirectory doesn't win over FSDirectory all the time, why?

2011-06-16 Thread Lance Norskog
The RAMDirectory uses Java memory, an FSDirectory does not. Holding
Java memory makes garbage collection work harder. The operating system
is very very good at managing disk buffers, and does a better job
using spare memory than Java does.

For real-world sites, RAMDirectory is almost always useless. Maybe the
Instantiated index stuff is more what you want?

Lance

On Tue, Jun 7, 2011 at 2:52 AM, zhoucheng2008 zhoucheng2...@gmail.com wrote:
 Makes sense. Thanks

 -Original Message-
 From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
 Sent: Tuesday, June 07, 2011 4:28 PM
 To: java-user@lucene.apache.org
 Subject: Re: RAMDirectory doesn't win over FSDirectory all the time, why?

 On Mon, 2011-06-06 at 15:29 +0200, zhoucheng2008 wrote:
 I read the lucene in action book and just tested the
 FSversusRAMDirectoryTest.java with the following uncommented:
 [...]Here is the output:

 RAMDirectory Time: 805 ms

 FSDirectory Time : 728 ms

 This is the code, right?
 http://java.codefetch.com/example/in/LuceneInAction/src/lia/indexing/FSversusRAMDirectoryTest.java

 The test is problematic as the same two tests run sequentially.

 If you change
  long ramTiming = timeIndexWriter(ramDir);
  long fsTiming = timeIndexWriter(fsDir);
 to
  long fsTiming = timeIndexWriter(fsDir);
  long ramTiming = timeIndexWriter(ramDir);
 my guess is that RAMDirectory will be faster. For a better
 comparison, perform each test in separate runs (make a test
 class just for RAMDirectory and one just for FSDirectory,
 then run them one at a time, each in its own JVM).

 One big problem when comparing RAMDirectory to file-access
 is caching. What you measure with a test might not be what
 you see in production, as the production index might be
 large compared to RAM available for file caching.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Solr 1.4.1: Weird query results

2011-04-19 Thread Lance Norskog
Look at the text definition stack. Does it have the same analyzer
and filter that you used to make the index, and in the same order?

The specific problem is that the text field includes a stemmer, and
your code probably did not. And so marine is stored as, maybe
'marin'.  To check this out, look at the 'schema browser' page off the
admin page. This will show you all of the indexed terms in each field.
Also look at the Analysis page: this lets you see how text is parsed
and changed in the analysis stack.

On Tue, Apr 19, 2011 at 2:56 PM, Erick Erickson erickerick...@gmail.com wrote:
 H, I don't see the problem either. It *sounds* like you don't really
 have the default search field defined the way you think you do. Did you 
 restart
 Solr after making that change?

 I'm assuming that when you say not created by Solr you mean that it's 
 created
 by Lucene. What version of Lucene and Solr are you using if that's true?

 You can test this by appending debugQuery=on to your query or checking
 the debug enable checkbox in the full query interface from the admin page.
 That should show you exactly what is being searched. You might also want
 to look at the analysis page for your field and see how your query
 is tokenized.

 But, like I said, this looks like it should work. If you can post the results 
 of
 adding debugQuery=on and your actual fieldType definition for text_ws 
 your
 field declaration for text and the defaultSearchField  from your schema
 that would help. I can't tell you how many times something that's eluded me
 for hours is obvious to someone else :)..

 Best
 Erick



 On Tue, Apr 19, 2011 at 3:59 PM, Erik Fäßler erik.faess...@uni-jena.de 
 wrote:
  Hallo there,

 my issue qualifies as newbie question I guess, but I'm really a bit
 confused. I have an index which has not been created by Solr. Perhaps that's
 already the point although I fail to see why this should be an issue with my
 problem.

 I use the admin interface to check which results particular queries bring
 in. My index documents have a field text which holds the document text.
 This text has only been white space tokenized. So in my schema, the type for
 this field is text_ws. My schema says
 defaultSearchFieldtext/defaultSearchField.

 When I now search for, say, 'marine' (without quotes), I don't get any
 search results. But when I search 'marine' (that is, embraced by double
 quotes) I get my document hits. Alternatively, I can prepend the field name:
 'text:marine' and will also get my results.

 Similar with this phrase query: marine mussels, where In marine mussels
 of the genus is a text snippet of a document. The phrase marine mussels
 won't give any hits. Searching for 'text:marine mussels' will give me the
 exact document containing this text snippet.

 I'm sure this has quite a simple explanation but I'm unable to find it right
 now ;-) Perhaps you can help with that.

 Thanks a lot!

 Best regards,

    Erik

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RE: ParallelMultisearcher

2011-03-22 Thread Lance Norskog
-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Help!

2011-03-01 Thread Lance Norskog
Check out the Mahout project: mahout.apache.org - there is a
lucene-based text classifier project in there.

Lance

On Tue, Mar 1, 2011 at 9:25 PM, Sundus Hassan sundushas...@gmail.com wrote:
 I am doing MS-Thesis on content-based text categorization.
 For This purpose I intend to use LUCENE.I need some
 help/tutorial/guide regarding:

 1) How to build and deploy LUCENE?
 2) Some basic information regarding working of Lucene?
 3) How to use LUCENE in my project?

 Will be looking forward for response.
 Thanks in advance.

 --
 Regards,
 Sundus Hassan

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene to search live, being-edited documents

2011-01-21 Thread Lance Norskog

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 3.0.3 Contrib Query Parser : Custom Field Name Builder

2011-01-08 Thread Lance Norskog
Bravo!

On Fri, Jan 7, 2011 at 10:39 PM, Adriano Crestani
adrianocrest...@gmail.com wrote:
 I created a JIRA to fix this problem:
 https://issues.apache.org/jira/browse/LUCENE-2855

 On Sat, Jan 8, 2011 at 1:32 AM, Adriano Crestani
 adrianocrest...@gmail.comwrote:

 Hi Christopher,

 Thanks for raising this problem, I always thought a little bit strange to
 use CharSequence as map key. Then a just did a little bit of research and
 found this on CharSequence javadoc:

 This interface does not refine the general contracts of the 
 equalshttp://download.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#equals(java.lang.Object)
  and 
 hashCodehttp://download.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#hashCode()
  methods. The result of comparing two objects that implement CharSequence is
 therefore, in general, undefined. Each object may be implemented by a
 different class, and there is no guarantee that each class will be capable
 of testing its instances for equality with those of the other. It is
 therefore inappropriate to use arbitrary CharSequence instances as
 elements in a set or as keys in a map.

 So I think every Set or Map that uses CharSequence on contrib queryparser
 should be forced to use String instead. I think there is no need to change
 any API, we just need to make sure that toString() is invoked on the
 CharSequence object before adding it to any Set or Map, this way we can fix
 this problem for next 3.x release. However, for 4.x, we should ideally
 change every API that receives or return MapCharSequence,... or
 SetCharSequence to use only String.



 On Fri, Jan 7, 2011 at 8:44 PM, Christopher St John 
 ckstj...@gmail.comwrote:

 I'm trying to:

  StandardQueryTreeBuilder b = …;
  b.setBuilder( myfield, fieldSpecificBuilder);

 In the debugger I see that the builder is registered in the
 QueryTreeBuilder's fieldNameBuilders map.

 When parsing, QueryTreeBuilder.getBuilder tries to look
 up the builder by using the FieldableNode's field but the
 debugger says the node's field is an UnescapedCharSequence,
 not a String, and the lookup fails.

 Registering the builder with an UnescapedCharSequence
 for the name instead of a String doesn't seem to help,
 presumably because UCS doesn't have a hash an equals
 or hash method.

 Suggestions? I've worked around it by registering a class
 based builder, checking for the field name and either
 delegating to the original builder or doing my custom
 processing, but it's a little awkward.

 -cks

 --
 Christopher St. John
 http://artofsystems.blogspot.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org







-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene/Solr for Plagiarism detection

2010-12-30 Thread Lance Norskog
The MoreLikeThis feature may be exactly what you want. Try it out.

On Thu, Dec 30, 2010 at 8:28 AM, Amel Fraisse amel.frai...@gmail.com wrote:
 Hello,

 No I'm not using cosine similarity metrics.


 2010/12/30 Shashi Kant sk...@sloan.mit.edu

 Have you considered using document similarity metrics such as Cosine
 Similarity?


 On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse amel.frai...@gmail.com
 wrote:
  Hello,
 
  I am using Lucene for plagiarism detection.
 
  The goal is that: when I have a new document, I will check on the solr
 index
  if there is a document that contain some common chunk.
 
  So to compute similarity between the query and a source document I would
 use
  this formula :
 
  Score (suspicious document, source document) = Number of common chunk
  between source document and suspicious document  / Number of total chunk
 in
  the suspicious document.
 
  So I have to change the scoring formula in the Similarity class.
 
  How can I change the scoring formula? ( by customizing only the
 Similarity
  class? or Scorer?)
 
  Do you have an Example of this use case?
 
  Thank for your help.
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 --
 Amel Fraisse




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene to search live, being-edited documents

2010-12-29 Thread Lance Norskog
Check out the Instantiated contrib for Lucene. This is an alternative
in-memory data structure that does not need commits and is faster (and
larger) than the Lucene Directory system.

On Wed, Dec 29, 2010 at 9:15 AM,  adam.salt...@gmail.com wrote:
 What has this to do with Lucene? You're thinking its index would be faster 
 than your own search algorithm. Would it though? Do you really need an index 
 or a good pattern matcher? I can't see what the stream buffer being flushed 
 by the user has to do with it? Don't you have to control that behaviour?

 Sent using BlackBerry® from Orange

 -Original Message-
 From: software visualization softwarevisualizat...@gmail.com
 Date: Wed, 29 Dec 2010 11:55:17
 To: java-user@lucene.apache.org; adam.salt...@gmail.com
 Reply-To: softwarevisualizat...@gmail.com
 Subject: Re: Using Lucene to search live, being-edited documents

 I am writing a text editor and have to provide  a certain search
 functionality .

 The  use case is for single user. A single  document is potentially very
 large and numerous such documents may be open and unflushed at any given
 time. Think many files of an IDE, except the files are larger. The user is
 free to change, say, variables names across documents which may be separate
 files opened simultaneously in a variety of tabs (say)  and being edited
 with no guarantee that the user has flushed or saved any of it.





 On Wed, Dec 29, 2010 at 10:37 AM, adam.salt...@gmail.com wrote:

 This is interesting. What are we driving at here? A single user? That
 doesn't make sense to unless you want to flag certain things as they
 construct the document. Or else why don't they know what is in their own
 document? There must be other ways apart from Lucene. It seems to me you
 want each line parsed as soon as entered and matched against some criteria.
 I would look at plugins for Open Office first. Or any other text editor. But
 not sure you have given enough information.
 Sent using BlackBerry® from Orange

 -Original Message-
 From: Sean spaceh...@foxmail.com
 Date: Wed, 29 Dec 2010 15:35:17
 To: java-userjava-user@lucene.apache.org
 Reply-To: java-user@lucene.apache.org
 Subject: Re:Using Lucene to search live, being-edited documents

 Does it make any sense?
  Every time a search result is shown, the original document could have been
 changed,  no matter how fast the indexing speed is.
 If you can accept this inconsistency, you do not need to index so
 frequently at all.


 -- Original --
 From:  software visualizationsoftwarevisualizat...@gmail.com;
 Date:  Wed, Dec 29, 2010 06:06 AM
 To:  java-userjava-user@lucene.apache.org;

 Subject:  Using Lucene to search live, being-edited documents


 This has probably been asked before but I couldn't find it, so...

 Is it possible / advisable / practical to use Lucene as the  basis of a
 live
 document search capability? By live document I mean a largish document
 such as a word processor might be able to handle which is being edited
 currently. Examples would be Word documents of some size that are begin
 written, really huge Java files, etc.

 The user is sitting there typing away and of course everything is changing
 in real time. This seems to be orthogonal to the idea of a Lucene index
 which is costly to construct  and costly to update.

 TIA






-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: asking about index verification tools

2010-11-17 Thread Lance Norskog
The Lucene CheckIndex program does this. It is a class somewhere in 
Lucene with a main() method.


Samarendra Pratap wrote:

It is not guaranteed that every term will be indexed. There is a limit on
maximum number of terms (as in lucene 3.0 and may be earlier too) per field.
Check out this
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)

On Tue, Nov 16, 2010 at 11:36 AM, Yakobjacob...@opensuse-id.org  wrote:

   

hello all,
I would like to ask about lucene index. I mean I created a simple
program that created lucene indexes and stored it in a folder. also I
had use a diagnostic tools name Luke to be able to lurk inside lucene
index and find out its content. and I know that lucene is a standard
framework when it come to building a search engine. but I just wanted
to be sure that lucene indexes every term that existed in a file.

I mean is there a way for me or some tools out there to verify that
the index contains in lucene indexes is dependable? and not a single
term went missing there?

I know that this is subjective question but I just wanted to hear your
two cents.
thanks though. :-)

tl;dr: how can we know that the index in lucene is correct?

--
http://jacobian.web.id

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 


   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is the best Analyzer and Parser for this type of question?

2010-11-15 Thread Lance Norskog
First, to understand what your query looks like, go to 
admin/analysis.jsp. It lets you see what happens to your queries when 
they go in. Then, do the query with debugQuery=true. This will add some 
complex junk to the end of the XML page that describes in painful detail 
exactly how each document was scored.


After all that- you might have a problem with the PrnP etc. stuff 
getting chopped up in weird ways. I don't know how people handle this in 
chemistry/bio search.


Lance

Ahmet Arslan wrote:
   

Example of Question:
- What is the role of PrnP in mad cow disease?
 

First thing is do not directly query questions. Manually formulate queries:
remove 'what' 'is' 'the' 'of' '?' etc.

For example i would convert this question into:

mad cow^5 cow disease^3 mad cow disease^15 role PrnP~5^2 role mad cow 
disease~45 mad^0.1 role^0.5 cow disease PrnP^10

   

I am running in 11.638 documents and the result is 10410
docs for this question (loww precision)
 

Use OR default operator, collect and evaluate top 1000 documents only.

And instead of Porter you can try KStem.
http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi

Try different length normalization described here. Also their Lucene query 
example (SpanNear) can inspire you.  
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can I use Lucene for this?

2010-11-13 Thread Lance Norskog
The Lucene MoreLikeThis tool in lucene/contrib/similar will do one
variant of what you want.

You can do this particular test in Solr- you'll find it much much
easier to put together.
For other text similarities, you'll have to code them directly.

Lance

On Sat, Nov 13, 2010 at 7:07 AM, Shashi Kant sk...@sloan.mit.edu wrote:
 There are multiple measures of similarity for documents: Cosine similarity
 is a frequently used one.


 On Sat, Nov 13, 2010 at 9:23 AM, Ciprian URSU ursu@gmail.com wrote:

 Hi Guys,

        I just find out about Lucene; after reading the main things on wiki
 it seems to be a great tool, but I still didn't find out how can I use it
 for my needs. What I want to do is a small tool which has some documents
 (mainly text) inside and then when I have a new document as input, to
 compare it with all those which are stored and to give me back as a
 percentage of similarity. I have read this part:
 http://wiki.apache.org/lucene-java/ScoresAsPercentages but it is not yet
 very clear to me how to use Lucene for that. Is it possible that some of
 you
 have a sample code for that?
        Thanks a lot, and I apologize for the fact that for many of you this
 looks like a stupid post :).

 Best Regards,
 Ciprian.





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle more than Integer.MAX_VALUE documents?

2010-11-02 Thread Lance Norskog
You would have to control your MergePolicy so it doesn't collapse
everything back to one segment.

On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog goks...@gmail.com wrote:
 2billion is a hard limit. Usually people split indexes into multiple
 index long before this, and use the parallel multi reader (I think) to
 read from all of the sub-indexes.

 On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
 lisheng.zh...@broadvision.com wrote:

 Hi,

 Now lucene uses integer as document id, so it means we cannot have more
 than 2^31-1 documents within one collection? Even if we use MultiSearcher
 the document id is still integer so it seems this is still a problem?

 This is really the limit of a segment, I think you can write you own
 collector and collect documents which higher (absolute) doc ids than
 INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
 should really rethink the way your search works and apply some
 sharding techniques. I really haven't been up to that many docs in a
 single index but I think it should work to have multiple segments with
 INT_MAX documents in it since we search on segment level provided if
 you collector supports it.

 simon

 We have been using lucene for some time and our document count is growing
 rather rapidly, maybe this is a much-discussed issue already, but I did not
 find the lead, any pointer would be really appreciated.

 Thanks very much for helps, Lisheng



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





 --
 Lance Norskog
 goks...@gmail.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle more than Integer.MAX_VALUE documents?

2010-11-01 Thread Lance Norskog
2billion is a hard limit. Usually people split indexes into multiple
index long before this, and use the parallel multi reader (I think) to
read from all of the sub-indexes.

On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
lisheng.zh...@broadvision.com wrote:

 Hi,

 Now lucene uses integer as document id, so it means we cannot have more
 than 2^31-1 documents within one collection? Even if we use MultiSearcher
 the document id is still integer so it seems this is still a problem?

 We have been using lucene for some time and our document count is growing
 rather rapidly, maybe this is a much-discussed issue already, but I did not
 find the lead, any pointer would be really appreciated.

 Thanks very much for helps, Lisheng



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email Indexing

2010-10-28 Thread Lance Norskog

Tika has some mailbox file parsing that includes metadata parsing.
For POP/IMAP email servers I don't know any tools.

Hasan Diwan wrote:

On 27 October 2010 18:16, Troy Wicalt...@wical.com  wrote:
   

Depends on what your trying to index, I suppose. Maildir or mbox? For some time 
now, off and on, I have been working to index an ezmlm mailing list archive. In 
the end, I went with Swish-E and have made quite a bit of progress. I am short 
of my complete goal though. The issue is that the search results do not return 
results that contain the subject, and there is currently no excerpt or phrase 
highlighting. My problem is the flat text email files I am working with have no 
xml or anything to help the indexer create fields from. I've not yet figured 
out how to convert the emails to xml.
 

Neither Maildir or mbox -- IMAP/POP doesn't care. Basically, I want to
build the index based on the contents of (my) gmail box. I can
retrieve the messages using IMAP, just need to figure out the
structure of the index.

Converting email to XML? Email me off-list and I'll provide you with
some help (as email =  XML has little to do with lucene).
   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text categorization / classification

2010-10-27 Thread Lance Norskog
There are tools for this in the Mahout project. These are oriented
toward large-scale work.

http://mahout.apache.org

There is a big learning curve and you have to learn Hadoop somewhat.

The book 'Collective Intelligence' includes a suite of Python tools
for small-scale experiments.

On Wed, Oct 27, 2010 at 1:12 PM, Maria Vazquez mvazq...@ova.st wrote:
 I need to auto-categorize a large number of documents. They are basically 
 news articles from major news sources (nytimes, npr, abcnews, etc).
 I'd like to categorize them automatically. Any suggestions?
 Lucene in Action suggests using a set of documents to build category vectors 
 and then comparing each document to each of those vectors and get the closest 
 one.
 The approach seems pretty simple (from other papers I read on text 
 categorization) but maybe you guys know of something out there that already 
 does this using Lucene/Solr.
 Thanks!
 Maria

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to export lucene index to a simple text file?

2010-09-21 Thread Lance Norskog
The Lucene CheckIndex program opens an index and walks all of the data 
structures. It is a good start for you.


Sahin Buyrukbilen wrote:

Thank you Uwe, I will read the docs and try to do it, however do you have an
example code? I need because I am not very familiar with Java.

Thank you.

Sahin

On Tue, Sep 21, 2010 at 12:29 PM, Uwe Schindleru...@thetaphi.de  wrote:

   

Hi,

Retrieve a TermEnum and iterate it. By that you get all terms and can
retrieve the docFreq, which is the second column in your table. Finally for
each term you position the TermDocs enum on this term to get all document
ids. Read docs of IndexReader/TermEnum/TermDocs about this.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 

-Original Message-
From: Sahin Buyrukbilen [mailto:sahin.buyrukbi...@gmail.com]
Sent: Tuesday, September 21, 2010 9:12 AM
To: java-user@lucene.apache.org
Subject: How to export lucene index to a simple text file?

Hi,

I am currently working on a project about private information retrieval
   

and I
 

need to have an inverted index file in txt format as follows:

Term tfreq t  Inverted list for t
-
and  16, 0.159
big   22, 0.148  3, 0.088
dark 16, 0.079
.
.
.
.

here thenumber1, number2  pairs are indicating: number1: doc ID, where
term t exist with a rank of number2.

I have created an index from 5492 txt files, however the index is
   

composed
of
 

different files and most of the data is not in the text format.

could somebody guide me to achieve this?

Thank you

Sahin.
   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 
   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Checksum and transactional safety for lucene indexes

2010-09-20 Thread Lance Norskog
If an index file is not completely written to disk, it never become 
available. Lucene has a file describing the current active index 
segments. It writes all new files to the disk, and changes the 
description file (segments.gen) only after that.


If the index files are corrupted, all bets are off. Usually the data 
structures are damaged and Lucene throws CorruptIndexExceptions, NPE or 
array out-of-bounds exceptions. There is no checksumming of the index 
files.


Lance

Pulkit Singhal wrote:

Hello Everyone,

What happens if:
a) lucene index gets written half-way to the disk and then something goes wrong?
b) the index gets corrupted on the file system?

When we open that directory location again using FSDirectory implementations:
a) Is there any provision for the code to clean out the previous file
and start a new index file because the older one was corrupted and
didn't match the checksum?
b) Or can we check that the # of documents that can be found in the
underlying index are now ZERO because they can't be parsed properly?
How can we do this?

- Pulkit

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Connection question

2010-09-17 Thread Lance Norskog
This can probably be done. The hardest part is cross-correlating your
Lucene analyzer use with the Solr analyzer stack definition. There are
a few things Lucene does that Solr doesn't- span queries for one.

Lance

On Fri, Sep 17, 2010 at 12:39 PM, Christopher Gross cogr...@gmail.com wrote:
 Yes, I'm asking about network connections.

 Are you aware of any documentation on how I can set up Solr to use the
 Lucene index that I already have?

 Thanks!

 -- Chris



 On Fri, Sep 17, 2010 at 3:02 PM, Ian Lea ian@gmail.com wrote:
 Are you asking about network connections?  There is no networking
 built into lucene.  There is in solr, and lucene can use directories
 on networked file systems.


 --
 Ian.


 On Fri, Sep 17, 2010 at 6:08 PM, Christopher Gross cogr...@gmail.com wrote:
 I'm trying to connect to a Lucene index on a test server.  All of the
 examples that I've found use a local directory to connect into the
 Lucene index, but I can't find one that will remotely hook into it.

 Can someone please point me in the right direction?  I'm fairly
 certain that someone has run into and fixed this problem, but I
 haven't been able to find a way to do it.

 Thanks for your help!

 -- Chris

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extra Analyzers

2010-09-10 Thread Lance Norskog
Please start a new thread instead of highjacking this one.

2010/9/10 Iam Jabour iamjab...@gmail.com:
 Hi,

 I got lucene from http://www.apache.org/dyn/closer.cgi/lucene/java/
 but I'm looking for extra Analyzers like BrazilianAnalyzer [1] and
 others. Where can I get extra packages for lucene?

 Ty

 [1] - 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers/3.0.2/org/apache/lucene/analysis/br/BrazilianAnalyzer.java/
 __
 Iam Jabour




 On Thu, Sep 9, 2010 at 2:23 AM, fulin tang tangfu...@gmail.com wrote:
 we now have 0.15 billion documents, which source size 1.5 TB, on 16 shards .

 I am very interested how you get your job done


 梦的开始挣扎于城市的边缘
 心的远方执着在脚步的瞬间
 我的宿命埋藏了寂寞的永远



 2010/8/26 Nigel nigelspl...@gmail.com:
 I'm curious about what the largest Lucene installations are, in terms of:

 - Greatest number of documents (i.e. X billion docs)
 - Largest data size (i.e. Y terabytes of indexes)
 - Most machines (i.e. Z shards or severs)

 Apart from general curiosity, the obvious follow-up question would be what
 approaches were taken to scale to extremes.

 We have ~11 billion documents indexed (growing at 2 billion per month), but
 I'm sure someone else has enough that this appears puny.  (-:

 Thanks,
 Chris


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sorting a Lucene index

2010-08-25 Thread Lance Norskog
It is also possible to sort by function. This allows you to avoid
storing an array of 1 int for all documents. It is slower than the raw
Lucene sort.

On Wed, Aug 25, 2010 at 1:46 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 On Wed, 2010-08-25 at 07:16 +0200, Shelly_Singh wrote:
 I have 1 bln documents to sort. So, that would mean ( 8 bln bytes == 8GB 
 RAM) bytes.
 All I have is 8 GB on my machine, so I do not think approach would work.

 This implies that your numeric value can be more than 2 billion. Are you
 sure that is true?


 First suggestion (simple): Ensure that your sort field is stored and
 sort by requesting the value for each document in the search result.
 This works okay when the number of hits is small.

 Second suggestion (complex): Make an int-array with the sort-order of
 your documents. This takes 4GB and needs to be calculated fully before
 use, which will take time. After that sorted searches will be very fast
 and handle a large number of hits well.

 You can let your indexer maintain the sort-array so that the existing
 order ban be re-used when adding documents. Whether modifying an
 existing order-array is cheaper than a full re-sort or not depends on
 your batch size.

 Regards,
 Toke Eskildsen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene applicability

2010-08-25 Thread Lance Norskog
A stepping stone to the above is that, in DB terms, a Lucene index is
only one table. It has a suite of indexing features that are very
different from database search. The features are oriented to searching
large bodies of text for ideas rather than concrete words. It
searches a lot faster than a DB. It also spends more time creating its
various indexes than a DB. Other points- you can't add or drop fields
or indexes.

On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
erickerick...@gmail.com wrote:
 The SOLR wiki has lots of good information, start there:
 http://wiki.apache.org/solr/

 Otherwise, see below...

 On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang 
 wolfgang.schrei...@itsv.at wrote:

 Hi all,

 We are currently evaluating potential search frameworks (such as Hibernate
 Search) which might be suitable to use in our project (using Spring, JPA
 with Hibernate) ...
 I am sending this E-Mail in hope you can advise me on a few issues that
 would help us in our decision making process.


 1.)    Is Lucene suitable for full text database searches? I read Lucene
 was designed to index and search documents but how does it behave querying
 relational data sets in general?


 Let's start be talking about the phrase full text database searches. One
 thing virtually all db-centric
 people trip over is trying to use SOLR as if it were a database. You just
 can't think about tables. The
 first time you think about using SOLR to do something join-like, stop and
 take a deep breath and
 think about documents instead. The general approach is to flatten your data
 so that each document
 contains all the relevant info. Yes, this leads to de-normalization. Yes,
 denormalized data makes a
 good DBA cringe. But that's the difference between searching and using a
 RDBMS.

 Document is somewhat misleading. A document in SOLR terms is just a
 collection of fields. And, BTW,
 there's no requirement that each document have the same fields (very unlike
 a DB).



 2.)    Can we make assumptions on query performance considering combined
 searches, range queries or structured data and wildcard searches? If we
 consider a data structure consisting of say 3 tables and each table contains
 a few million entries (e.g. first name, last name and address fields) and we
 search for common values (such as 'John', 'Smith' and 'New York') where

 a.       each value for itself and each combination would result in
 millions of hits


 Sure, but what those assumptions are is totally dependent on how you've set
 things up. SOLR has been successfully
 used on several billion document indexes. There are tools for making all
 that work (i.e. replication, sharding, etc)
 built into SOLR. So I suspect you can make things work. Several million
 documents is not that large a data set.

 As always, there are tradeoffs between speed and complexity. But from what
 you've described
 I see no show stoppers.



 b.      a person can have multiple first names and we want to make sure to
 receive any combination of the last name with any first name


 This just sounds like an OR. But the queries can be pretty complex queries.
 Some examples of what you expect would help.
 See multi-valued fields. So, a document can have multiple firstname
 entries. Again, not like a DB (your reflexes will trip you
 up on this point G).


 c.       we search for a last name and a range of birth dates


 Sure, range queries work just fine. Note that dates can trip you up, look at
 triedate if you experiment.


 3.)    Transaction safety: How does Lucene handle indexes? If we update
 data model and index, what happens to the index if anything goes wrong as
 soon as the data model has been persisted?


 A lot of work has been done to make SOLR quite robust if anything goes
 wrong. That said, how are you backing up your data?
 That is, what is the source of the data you're going to index? If you're
 relying on your SOLR index to be your backup, you simply must back it up
 somewhere often enough to get by if your building burns down. I'd also
 think about storing your original input...

 This is no different than a DB. you have to guard against the disk crashing,
 someone walking by with a powerful magnet,  earthquake, flood, fires
 G.

 Do note that if you modify your index schema, no existing documents reflect
 the new schema, you have to reindex them.



 I hope I made the issues clear to you, just some general thoughts about how
 Lucene would behave in a real world application scenario ... Any support or
 pointers to helpful documents or Web links are highly appreciated!
 Cheers for now,

 w






-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Solr SynonymFilter in Lucene analyzer

2010-08-18 Thread Lance Norskog
Yes, you need an analyzer that leaves successive words together as one
long term. This might be easier to do with the new CharFilter tool,
which processes text before it goes to the tokenizer.

What you are doing here is similar to Parts-Of-Speech analysis, where
text analysis software parses a sentence and labels words 'Noun',
'Verb', etc. One suite stores these labels as payloads on the terms.
This might be a better way to store your categories, rather than using
the synonym filter.

On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan
arunrangara...@gmail.com wrote:
 I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter
 is the one that prevents multi-word synonyms like New York from getting
 mapped to the generic synonym name like CONCEPTYcity. It appears to me that
 an analyzer which recognizes that a white-space is inside a synonym like
 New York will be required. Do I need to implement one like this or is
 there already an analyzer I can use? Looks like I am missing something here,
 since Solr's SynonymFilter is supposed to handle this. Can someone tell me
 what is the correct way to integrate Solr's SynonymFilter within a custom
 lucene analyzer? Thanks.


 On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
 arunrangara...@gmail.comwrote:

 I am trying to have multi-word synonyms work in lucene using Solr's *
 SynonymFilter*.

 I need to match synonyms at index time, since many of the synonym lists are
 huge. Actually they are really not synonyms, but are words that belong to a
 concept. For example, I would like to map {New York, Los Angeles, New
 Orleans, Salt Lake City...}, a bunch of city names, to the concept called
 city. While searching, the user query for the concept city will be
 translated to a keyword like, say CONCEPTcity, which is the synonym for
 any city name.

 Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
 all I could match for CONCEPTcity is single word city names like
 Chicago, Seattle, Boston, etc., It would not match multi-word city
 names like New York, Los Angeles, etc.,

 I tried using Solr's SynonymFilter in tokenStream method in a custom
 Analyzer (that extends org.apache.lucene.analysis.
 Analyzer - lucene ver. 2.9.3) using:

 *    public TokenStream tokenStream(String fieldName, Reader reader) {
         TokenStream result = new SynonymFilter(
                 new WhitespaceTokenizer(reader),
                 synonymMap);
         return result;
     }
 *
 where *synonymMap* is loaded with synonyms using

 *synonymMap.add(conceptTerms, listOfTokens, true, true);*

 where *conceptTerms* is of type *ArrayListString* of all the terms in a
 concept and *listofTokens* is of type *ListToken  *and contains only the
 generic synonym identifier like *CONCEPTcity*.

 When I print synonymMap using synonymMap.toString(), I get the output like

 {New York={Chicago={Seattle={New
 Orleans=[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null}}}}

 so it looks like all the synonyms are loaded. But if I search for
 CONCEPTcity then it says no matches found. I am not sure whether I have
 loaded the synonyms correctly in the synonymMap.

 Any help will be deeply appreciated. Thanks!





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Migrating from Lucene 2.9.1 to Solr 1.4.0 - Performance issues under heavy load

2010-08-03 Thread Lance Norskog
Is this an apples to apples comparison? That is, are you measuring
the same complete flow on both apps? Does the Lucene app return fields
via HTTP?

On Tue, Aug 3, 2010 at 11:28 AM, Ophir Adiv firt...@gmail.com wrote:
 Hi,



 I’m currently involved in a project of migrating from Lucene 2.9.1 to Solr
 1.4.0.

 During stress testing, I encountered this performance problem:

 While actual search times in our shards (which are now running Solr) have
 not changed, the total time it takes for a query has increased dramatically.

 During this performance test, we of course do not modify the indexes.

 Our application is sending Solr select queries concurrently to the 8 shards,
 using CommonsHttpSolrServer.

 I added some timing debug messages, and found that
 CommonsHttpSolrServer.java, line 416 takes about 95% of the application’s
 total search time:

 int statusCode = _httpClient.executeMethod(method);



 Just to clarify: looking at access logs of the Solr shards, TTLB for a query
 might be around 5 ms. (on all shards), but httpClient.executeMethod() for
 this query can be much higher – say, 50 ms.

 On average, if under light load queries take 12 ms. on average, under heavy
 load the take around 22 ms.



 Another route we tried to pursue is add the “shards=shard1,shard2,…”
 parameter to the query instead of doing this ourselves, but this doesn’t
 seem to work due to an NPE caused by QueryComponent.returnFields(), line
 553:

 if (returnScores  sdoc.score != null) {



 where sdoc is null. I saw there is a null check on trunk, but since we’re
 currently using Solr 1.4.0’s ready-made WAR file, I didn’t see an easy way
 around this.

 Note: we’re using a custom query component which extends QueryComponent, but
 debugging this, I saw nothing wrong with the results at this point in the
 code.



 Our previous code used HTTP in a different manner:

 For each request, we created a new
 sun.net.www.protocol.http.HttpURLConnection, and called its getInputStream()
 method.

 Under the same load as the new application, the old application does not
 encounter the delays mentioned above.



 Our current code is initializing CommonsHttpSolrServer for each shard this
 way:



                                MultiThreadedHttpConnectionManager
 httpConnectionManager = new MultiThreadedHttpConnectionManager();


 httpConnectionManager.getParams().setTcpNoDelay(true);


 httpConnectionManager.getParams().setMaxTotalConnections(1024);


 httpConnectionManager.getParams().setStaleCheckingEnabled(false);

                                HttpClient httpClient = new HttpClient();

                                HttpClientParams params = new
 HttpClientParams();


 params.setCookiePolicy(CookiePolicy.IGNORE_COOKIES);

                                params.setAuthenticationPreemptive(false);


 params.setContentCharset(StringConstants.UTF8);

                                httpClient.setParams(params);


 httpClient.setHttpConnectionManager(httpConnectionManager);



 and passing the new HttpClient to the Solr Server:

 solrServer = new CommonsHttpSolrServer(coreUrl, httpClient);



 We tried two different ways – one with a single
 MultiThreadedHttpConnectionManager and HttpClient for all the SolrServer’s,
 and the other with a new MultiThreadedHttpConnectionManager and HttpClient
 for each SolrServer.

 Both tries yielded similar performance results.

 Also tried to give setMaxTotalConnections() a much higher connections number
 (1,000,000) – didn’t have an effect.



 Would love to hear what you think about this. TIA,

 Ophir




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rank results only on some fields

2010-07-31 Thread Lance Norskog
Can't this use case be done with a function query?

On Sat, Jul 31, 2010 at 1:59 AM, Uwe Schindler u...@thetaphi.de wrote:
 Here some example code, the method is getFieldQuery() (Lucene 2.9 or 3.0 or
 following, don't use that approach before, because QueryWrapperFilter is not
 effective before 2.9 for that):

 @Override
 protected Query getFieldQuery(String field, String queryText)  throws
 ParseException {
        Query q = super.getFieldQuery(field,queryText);
        if (!TITLE.equals(field))
                q = new ConstantScoreQuery(new QueryWrapperFilter(q));
        return q;
 }

 I hope that explains itself. You may look at other Query type factories in
 QP that produce scoring queries and wrap them similar. But e.g. WildCard and
 RangeQueries are constant score. Phrases are also handled by this method.
 Only the slop setting may not work correctly after this (look at the
 instanceof checks in getFieldQuery(..., slop)).

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Saturday, July 31, 2010 10:19 AM
 To: java-user@lucene.apache.org
 Subject: RE: Rank results only on some fields

 You can construct the query using a customized query parser that wraps all
 queries not with the suggested field name using a new
 ConstantScoreQuery(new QueryWrapperFilter(originalCreatedQuery)).
 Override
 newFieldQuery() to do that and pass the super call to this ctor chain.

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: Philippe [mailto:mailer.tho...@gmail.com]
  Sent: Saturday, July 31, 2010 10:04 AM
  To: java-user@lucene.apache.org
  Subject: Rank results only on some fields
 
  Hi,
 
  I want to rank my results only on parts of my query.  E.g my query is
  TITLE:Lucene AND AUTHOR:Manning. After this query standard lucene
  ranking for both fields take place.
 
  However, is it possible to query the index using the full query and
  rank results only according to the TITLE-Field?
 
  Regards,
       Philippe
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rank results only on some fields

2010-07-31 Thread Lance Norskog
Oops, didn't notice that this was java-user. I had a Solr 'why write
more code' reaction ?)

On Sat, Jul 31, 2010 at 1:56 PM, Uwe Schindler u...@thetaphi.de wrote:
 We don't want to modify the ranking using functions, we want to switch some 
 queries to constant score mode. The QueryParser subclassing is just to make 
 it convenient.

 In general to strip off scores from queries, you use new 
 ConstantScoreQuery(new QueryWrapperFilter(query)), this is used inside 
 Lucene, too (MultiTermQuery,...). The trick is to normalize the Scorer to 
 return a constant value (boost of CSQ). This can be done by first wrapping 
 the scorer of the original query in a filter and then add a scorer to the 
 filter again, that returns a constant.

 With function queries you can do something similar by returning a constant in 
 the CustomScoreProvider. The QWF/CSQ trick is more convenient and used quite 
 often inside Lucene, too.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Saturday, July 31, 2010 10:50 PM
 To: java-user@lucene.apache.org
 Subject: Re: Rank results only on some fields

 Can't this use case be done with a function query?

 On Sat, Jul 31, 2010 at 1:59 AM, Uwe Schindler u...@thetaphi.de wrote:
  Here some example code, the method is getFieldQuery() (Lucene 2.9 or
  3.0 or following, don't use that approach before, because
  QueryWrapperFilter is not effective before 2.9 for that):
 
  @Override
  protected Query getFieldQuery(String field, String queryText)  throws
  ParseException {
         Query q = super.getFieldQuery(field,queryText);
         if (!TITLE.equals(field))
                 q = new ConstantScoreQuery(new QueryWrapperFilter(q));
         return q;
  }
 
  I hope that explains itself. You may look at other Query type
  factories in QP that produce scoring queries and wrap them similar.
  But e.g. WildCard and RangeQueries are constant score. Phrases are also
 handled by this method.
  Only the slop setting may not work correctly after this (look at the
  instanceof checks in getFieldQuery(..., slop)).
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Uwe Schindler [mailto:u...@thetaphi.de]
  Sent: Saturday, July 31, 2010 10:19 AM
  To: java-user@lucene.apache.org
  Subject: RE: Rank results only on some fields
 
  You can construct the query using a customized query parser that
  wraps all queries not with the suggested field name using a new
  ConstantScoreQuery(new QueryWrapperFilter(originalCreatedQuery)).
  Override
  newFieldQuery() to do that and pass the super call to this ctor chain.
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: Philippe [mailto:mailer.tho...@gmail.com]
   Sent: Saturday, July 31, 2010 10:04 AM
   To: java-user@lucene.apache.org
   Subject: Rank results only on some fields
  
   Hi,
  
   I want to rank my results only on parts of my query.  E.g my query
   is TITLE:Lucene AND AUTHOR:Manning. After this query standard
   lucene ranking for both fields take place.
  
   However, is it possible to query the index using the full query and
   rank results only according to the TITLE-Field?
  
   Regards,
        Philippe
  
   ---
   -- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



 --
 Lance Norskog
 goks...@gmail.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for searcher memory usage?

2010-07-14 Thread Lance Norskog
Glen, thank you for this very thorough and informative post.

Lance Norskog

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: segment_N file is missed

2010-06-19 Thread Lance Norskog
?

 That could be; Maryam is that what happened?

 Mike

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: phrase search in a particular case

2010-06-19 Thread Lance Norskog
SpanFirstQuery is the clean option. Another option is to add a start
token to each title. Then, search for startToken oil spill. This
will be faster than SpanFirstQuery. But it also requires doing
something weird to the field.

Lance

On Thu, Jun 17, 2010 at 3:19 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 SpanFirstQuery?

 Mike

 On Thu, Jun 17, 2010 at 3:23 PM, rakesh rakesh rakeshiit.2...@gmail.com 
 wrote:
 Hi,

 I have thousands of article titles in lucene index. So for a query Oil
 spill I want to return all the article title starts with Oil spill. I do
 not want those titles which has this phrase but do not start with this.

 Can anyone help me.

 Thanks in advance

 Thanks
 rakesh


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A question bout google search index?

2010-06-13 Thread Lance Norskog
http://research.google.com/pubs/DistributedSystemsandParallelComputing.html

On Thu, Jun 10, 2010 at 1:51 AM, Yuval Feinstein yuv...@answers.com wrote:
 Most of the implementation of Google's search index is kept secret by Google.
 Based on publicly available information, the indexes are quite different -
 Google uses its BigTable and MapReduce technologies to efficiently distribute 
 the index.
 There are similar efforts in the Lucene ecosystem - Solr Cloud is an advanced 
 one,
 Which is currently in development.
 As Google's scoring algorithm uses hundreds of signals, I guess they store 
 data pertinent to these signals in the index.
 Lucene's index holds relatively few pieces of information about every 
 document (posting lists, term vectors,
 Sometimes norms and payloads).
 I believe there are other differences as well,
 But one could only guess what they are...
 Cheers,
 Yuval


 -Original Message-
 From: luocanrao [mailto:luocan19826...@sohu.com]
 Sent: Wednesday, June 09, 2010 5:18 PM
 To: java-user@lucene.apache.org
 Subject: A question bout google search index?

 A news bout google search index. Index system of Lucene can also support
 realtime search,

 Is there some difference between them?



 With Caffeine, we analyze the web in small portions and update our search
 index on a continuous basis, globally. As we find new pages, or new
 information on existing pages, we can add these straight to the index. That
 means you can find fresher information than ever before-no matter when or
 where it was published.



 Caffeine lets us index web pages on an enormous scale. In fact, every second
 Caffeine processes hundreds of thousands of pages in parallel. If this were
 a pile of paper it would grow three miles taller every second. Caffeine
 takes up nearly 100 million gigabytes of storage in one database and adds
 new information at a rate of hundreds of thousands of gigabytes per day. You
 would need 625,000 of the largest iPods to store that much information; if
 these were stacked end-to-end they would go for more than 40 miles


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: segment_N file is missed

2010-06-13 Thread Lance Norskog
The CheckIndex class/program will recreate the segment files when it
removes a segment from an index. That's the only source I've found for
how to make these files.

If you are able to hack this up, making a CFSDirectory would be a
wonderful addition to the Lucene Directory suite. A CFS file is a
complete Lucene index and it is much much easier to deploy single
files than file sets.

On Wed, Jun 9, 2010 at 6:33 AM, maryam ma'danipour
m.madanip...@gmail.com wrote:
 Hello to all !
  I have _0.cfs file of a lucene index directory but segments.gen and
 segments_2 are missing. Can I generate the segments.gen and segments_2 files
 without having to regenerate the _0.cfs file. Does these segments files
 contain any index specific data, which will thus force me to regenerate the
 entire index again. Or can I just generate the two segments file by
 copying these from another lucene index directory generated with the same
 lucene version or can I merge this inex with another index which has
 segments_N to retrieve the data ?

 Thanks




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Solr tutorial

2010-06-01 Thread Lance Norskog
Use solr-user@ instead of java-user@ . You'll find more knowledgeable people.

On Mon, May 31, 2010 at 6:36 PM, N Hira nh...@cognocys.com wrote:
 I don't know of a single tutorial that puts it all together, but the rich 
 documents feature implemented in Solr-284 would be where I would start:
 https://issues.apache.org/jira/browse/SOLR-284

 Look here if you're using Solr 1.4  -- it should address your needs:
 http://wiki.apache.org/solr/ExtractingRequestHandler


 Good luck,

 -h



 - Original Message 
 From: s...@icarinae.com s...@icarinae.com
 To: java-user@lucene.apache.org
 Sent: Mon, May 31, 2010 8:17:02 PM
 Subject: Solr tutorial

 Hi,

 I am kind of struggling to setup Solr to search pdf files. I am following
 documents from lucidimagination and wiki. Can someone please point to a
 good Solr tutorial which involve step by step instrunctions to search/index
 pdf document, highlighting and snippting.

 Thanks in advance,

 Deepak

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Right memory for search application

2010-04-27 Thread Lance Norskog
Solr's timestamp representation (TrieDateField) is tuned for space and
speed. It has a compressed representation, and sorts with far less
space than Strings.

Also you get something called a date facet, which lets you bucketize
facet searches by time block.

On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Samarendra Pratap [samarz...@gmail.com] wrote:
 1. Our default option is sort by score, however almost 8% of searches use
 sorting on a field (mmddHHMMSS). This field is indexed as string (not as
 NumericField or DateField).

 Guessing that the timestamp is practically unique for each document, sorting 
 by String takes up a bit more than
 18M * (40 bytes + 2 * mmddHHMMSS.length() bytes) ~= 1.2 GB of RAM as 
 the Strings are cached. Coupled with the normal overhead of just opening an 
 index of your size (500MB by your measurements?), I would have guessed that 
 3600MB would definitely be enough to open the index and do sorted searches.

 I realize that fiddling with production servers is dangerous, but connecting 
 with JConsole and forcing a garbage collection might be acceptable? That 
 should enable you to determine whether you're leaking memory or if it's just 
 the JVM being greedy. I'd guess you leaking though, as HotSpot does not 
 normally allocate up to the limit if it does not need to.

 Anyway, changing to one of the optimized fields for sorting dates should 
 shave 1 GB off the memory requirement, so I'll recommend doing that no matter 
 what the main cause of your memory problems is.

 Regards,
 Toke Eskildsen

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Utility program to extract a segment

2010-04-14 Thread Lance Norskog
Is there a program available that makes a new index with one or more
segments from an existing index? (The immediate use case for this is
doing forensics on corrupted indexes.)

The user interface would be:
extract -segments _ab,_g9 oldindex newindex

This would copy the files for segments _ab and _g9 into a new
directory and generate a segments.gen for just those two segments. Is
this all that's needed?

-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter and memory usage

2010-04-12 Thread Lance Norskog
-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[JOB] Solr/Lucene developer wanted in startup: San Francisco Peninsula, CA, USA

2007-10-09 Thread Lance Norskog
Hi-
 
We are a startup in the web indexing space. We are looking for an
experienced search engine developer for our team. We are using Solr and
Lucene, but search is search and solid experience with any large search
engine is welcome.
 
At this point we do not wish to disclose our name. We have solid funding and
are a real business. We have a contract with a large company to lease our
index and provide various services.
 
Thank for your time. Please contact me at [EMAIL PROTECTED]
 
Lance Norskog
650-922-8831


UTF-8/unicode input in querying in Lucene

2007-09-14 Thread Lance Norskog
Hi-
 
The page http://lucene.apache.org/java/docs/queryparsersyntax.html does not
mention that \u Unicode syntax is supported.
For example, \u0048\u0045\u004c\u004c\u004f is HELLO.
 
Please add this to the page, it took experimentation to discover it.
 
Thanks,
 
Lance Norskog