BTRFS ?

2014-12-21 Thread Otis Gospodnetic
Hi,

I spotted Uwe's comment in JIRA the other day BTRFS, which might also
bring some cool things for Lucene..

Has anyone tried Lucene (or Solr or Elasticsearch) with BTRFS and seen some
(performance) benefits over ext3/4 or xfs for example?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


JOB @ Sematext: Professional Services Lead = Head

2014-02-18 Thread Otis Gospodnetic
Hello,


We have what I think is a great opening at Sematext. Ideal candidate would
be in New York, but that's not an absolute must. More info below + on
http://sematext.com/about/jobs.html in job-ad-speak, but I'd be happy to
describe what we are looking for, what we do, and what types of companies
we work with in regular-human-speak off-line.

DESCRIPTION

Sematext is hiring a technical, hands-onProfessional Services Lead to join,
lead, and grow the Professional Services side of Sematext and potentially
grow into the Head role.

REQUIREMENTS

* Experience working with Solr or Elasticsearch

* Plan and coordinate customer engagements from business and technical
perspective

* Identify customer pain points, needs, and success criteria at the onset
of each engagement

* Provide expert-level consulting and support services and strive to be a
trustworthy advisor to a wide range of customers

* Resolve complex search issues involving Solr or Elasticsearch

* Identify opportunities to provide customers with additional value through
our products or services

* Communicate high-value use cases and customer feedback to our Product
teams

* Participate in open source community by contributing bug fixes,
improvements, answering questions, etc.

EXPERIENCE

* BS or higher in Engineering or Computer Science preferred

* 2 or more years of IT Consulting and/or Professional Services experience
required

* Exposure to other related open source projects (Hadoop, Nutch, Kafka,
Storm, Mahout, etc.) a plus

* Experience with other commercial and open source search technologies a
plus

* Enterprise Search, eCommerce, and/or Business Intelligence experience a
plus

* Experience working in a startup a plus

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


Re: MergePolicy for append-only indices?

2014-01-28 Thread Otis Gospodnetic
Thanks Mike(s)  Co.
Added https://issues.apache.org/jira/browse/LUCENE-5419

Sounds like a killer feature :)

Otis



On Wed, Jan 8, 2014 at 4:17 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov
 msoko...@safaribooksonline.com wrote:
  I think the key optimization when there are no deletions is that you
 don't
  need to renumber documents and can bulk-copy blocks of contiguous
 documents,
  and that is independent of merge policy. I think :)

 Merging of term vectors and stored fields will always use bulk-copy
 for contiguous chunks of non-deleted docs, so for the append-only case
 these will be the max chunk size and be efficient.

 We have no codec that implements bulk merging for postings, which
 would be interesting to pursue: in the append-only case it's possible,
 and merging of postings is normally by far the most time consuming
 step of a merge.

 Also, no RAM will be used holding the doc mapping, since the docIDs
 don't change.

 These benefits are independent of the MergePolicy.

 I think TieredMergePolicy will work fine for append-only; I'm not sure
 how you'd improve on its approach.  It will in general renumber the
 docs, so if that's a problem, apps should use LogByteSizeMP.

 Mike McCandless

 http://blog.mikemccandless.com



MergePolicy for append-only indices?

2014-01-06 Thread Otis Gospodnetic
Hi,
(cross-posting to both Solr and Lucene user lists because while this is a
Lucene-level question, I suspect a lot of people who know about this or are
interested in this subject are actually on the Solr list)

I have a large append-only index and I looked at merge policies hoping to
identify one that is naturally more suitable for indices without any
updates and deletions, just adds.

I've read
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/TieredMergePolicy.htmland
the javadocs for its cousins, but it doesn't look like any of them is
more suited for append-only index than the other ones and Tiered MP having
more knobs is probably the best one to use.

I was wondering if I was missing something, if one of the MPs is in fact
better for append-only indices OR if one can suggest how one could write a
custom MP that's specialized for append-only indices.

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


Re: Lucene for Log file indexing and search

2013-09-20 Thread Otis Gospodnetic
Hi,

Logstash is the piece that first touches your logs, filters them, and then 
outputs them somewhere.
People often use it with ElasticSearch.  Once logs are in ES, they look at them 
with Kibana.

Note: somebody should write a Logstash output for Solr!

In Solr world there is Flume, which has a Solr sink.
Flume has file tailing capability and Cloudera's Morphlines should allow one to 
process the log much like Logstash filters let you process them.

At Sematext we've built something called Logsene - http://sematext.com/logsene/ 
, which uses some of the above technologies or plays nice with them.


Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 





 From: Ivan Krišto ivan.kri...@gmail.com
To: java-user@lucene.apache.org 
Cc: gudiseashok gudise.as...@gmail.com 
Sent: Friday, September 20, 2013 1:59 AM
Subject: Re: Lucene for Log file indexing and search
 

On 09/19/2013 07:41 PM, gudiseashok wrote:
 I am learning lucene, I am developing an application do do a search in log
 files in multi-environment boxes, I have googled for the deeper
 understanding, but all examples were just referring for just field File
 Name  Modification (i.e. fieldtypes associated with text search) and they
 are returning results. 

Hello!

If you don't have some extremly specific needs checkout Logstash --
http://logstash.net/  http://www.elasticsearch.org/overview/logstash/
It is powered by ElasticSearch (product similar to Solr, also based on
Lucene).


  Regards,
    Ivan Krišto




Re: Content based recommender using lucene/solr

2013-06-28 Thread Otis Gospodnetic
Hi,

Have a look at http://www.youtube.com/watch?v=13yQbaW2V4Y .  I'd say
it's easier than Mahout, especially if you already have and know your
way around Solr.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Jun 28, 2013 at 2:02 PM, Luis Carlos Guerrero Covo
lcguerreroc...@gmail.com wrote:
 Hey saikat, thanks for your suggestion. I've looked into mahout and other
 alternatives for computing k nearest neighbors. I would have to run a job
 and computer the k nearest neighbors and track them in the index for
 retrieval. I wanted to see if this was something I could do with lucene
 using lucene's scoring function and solr's morelikethis component. The job
 you specifically mention is for Item based recommendation which would
 require me to track the different items users have viewed. I'm looking for
 a content based approach where I would use a distance measure to establish
 how near items are (how similar) and have some kind of training phase to
 adjust weights.


 On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.comwrote:

 Why not just use mahout to do this, there is an item similarity algorithm
 in mahout that does exactly this :)


 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

 You can use mahout in distributed and non-distributed mode as well.

  From: lcguerreroc...@gmail.com
  Date: Fri, 28 Jun 2013 12:16:57 -0500
  Subject: Content based recommender using lucene/solr
  To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
 
  Hi,
 
  I'm using lucene and solr right now in a production environment with an
  index of about a million docs. I'm working on a recommender that
 basically
  would list the n most similar items to the user based on the current item
  he is viewing.
 
  I've been thinking of using solr/lucene since I already have all docs
  available and I want a quick version that can be deployed while we work
 on
  a more robust recommender. How about overriding the default similarity so
  that it scores documents based on the euclidean distance of normalized
 item
  attributes and then using a morelikethis component to pass in the
  attributes of the item for which I want to generate recommendations? I
 know
  it has its issues like recomputing scores/normalization/weight
 application
  at query time which could make this idea unfeasible/impractical. I'm at a
  very preliminary stage right now with this and would love some
 suggestions
  from experienced users.
 
  thank you,
 
  Luis Guerrero





 --
 Luis Carlos Guerrero Covo
 M.S. Computer Engineering
 (57) 3183542047

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Content based recommender using lucene/solr

2013-06-28 Thread Otis Gospodnetic
Hi,

It doesn't have to be one or the other.  In the past I've built a news
recommender engine based on CF (Mahout) and combined it with Content
Similarity-based engine (wasn't Solr/Lucene, but something custom that
worked with ngrams, but it may have as well been Lucene/Solr/ES).  It
worked well.  If you haven't worked with Mahout before I'd suggest the
approach in that video and going from there to Mahout only if it's
limiting.

See Ted's stuff on this topic, too:
http://www.slideshare.net/tdunning/search-as-recommendation +
http://berlinbuzzwords.de/sessions/multi-modal-recommendation-algorithms
(note: Mahout, Solr, Pig)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Jun 28, 2013 at 2:07 PM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 You could build a custom recommender in mahout to accomplish this, also just 
 out of curiosity why the content based approach as opposed to building a 
 recommender based on co-occurence.  One other thing, what is your data size, 
 are you looking at scale where you need something like hadoop?

 From: lcguerreroc...@gmail.com
 Date: Fri, 28 Jun 2013 13:02:00 -0500
 Subject: Re: Content based recommender using lucene/solr
 To: solr-u...@lucene.apache.org
 CC: java-user@lucene.apache.org

 Hey saikat, thanks for your suggestion. I've looked into mahout and other
 alternatives for computing k nearest neighbors. I would have to run a job
 and computer the k nearest neighbors and track them in the index for
 retrieval. I wanted to see if this was something I could do with lucene
 using lucene's scoring function and solr's morelikethis component. The job
 you specifically mention is for Item based recommendation which would
 require me to track the different items users have viewed. I'm looking for
 a content based approach where I would use a distance measure to establish
 how near items are (how similar) and have some kind of training phase to
 adjust weights.


 On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.comwrote:

  Why not just use mahout to do this, there is an item similarity algorithm
  in mahout that does exactly this :)
 
 
  https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
 
  You can use mahout in distributed and non-distributed mode as well.
 
   From: lcguerreroc...@gmail.com
   Date: Fri, 28 Jun 2013 12:16:57 -0500
   Subject: Content based recommender using lucene/solr
   To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
  
   Hi,
  
   I'm using lucene and solr right now in a production environment with an
   index of about a million docs. I'm working on a recommender that
  basically
   would list the n most similar items to the user based on the current item
   he is viewing.
  
   I've been thinking of using solr/lucene since I already have all docs
   available and I want a quick version that can be deployed while we work
  on
   a more robust recommender. How about overriding the default similarity so
   that it scores documents based on the euclidean distance of normalized
  item
   attributes and then using a morelikethis component to pass in the
   attributes of the item for which I want to generate recommendations? I
  know
   it has its issues like recomputing scores/normalization/weight
  application
   at query time which could make this idea unfeasible/impractical. I'm at a
   very preliminary stage right now with this and would love some
  suggestions
   from experienced users.
  
   thank you,
  
   Luis Guerrero
 
 



 --
 Luis Carlos Guerrero Covo
 M.S. Computer Engineering
 (57) 3183542047


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Document scoring order?

2013-04-03 Thread Otis Gospodnetic
Hi,

When Lucene scores matching documents, what is the order in which
documents are processed/scored and can that be changed?  I'm guessing
it scores matches in whichever order they are stored in the index/on
disk, which means by increasing docIDs?

I do see some out of order scoring is possible but can one visit
docs to score in, say, lexicographical order of a specific document
field?

Thanks,
Otis
--
Solr  ElasticSearch Support
http://sematext.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any benchmark corps to evaluate performance of specified query?

2013-01-17 Thread Otis Gospodnetic
Hi,

Maybe https://github.com/sematext/ActionGenerator could be of help?
We use it to produce query load for Solr and ElasticSearch and the whole thing 
is extensible, so you could easily add support for talking directly to Lucene.

Oh, and there is the benchmark in Lucene: 
http://lucene.apache.org/core/4_0_0/benchmark/index.html

Otis


Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




 From: lukai lukai1...@gmail.com
To: java-user@lucene.apache.org 
Sent: Wednesday, January 16, 2013 2:19 PM
Subject: Any benchmark corps to evaluate performance of specified query?
 
As the title, do we have any benchmark corps to test performance of a new
query implementation? like 10k docs, or 1M docs?

Thanks,




Poll: how to report # of docs in index over time

2012-02-13 Thread Otis Gospodnetic
Hello,

Quick poll for those who have an opinion about what index size monitoring 
should report in terms of the number of documents in the index.

Poll: http://blog.sematext.com/2012/02/13/poll-solr-index-size-monitoring/

For example, imagine that in some 5-minute time period (say 10:00 AM to 10:05 
AM) we check the index 5 times (in reality we do it much for frequently) and 
each time we do that we find the index has a different number of documents in 
it: 10, 15, 20, 25, and finally 30 documents.  Now imagine this data as a graph 
showing the number of indexed document over time, but with the smallest  time 
period shown being a 5 minutes interval.

Given the above example,how many documents should this graph report for the 
10:00 – 10:05 AM period?
Should it show the minimum – 10?  Average – 20?  Mean – 20?  Maximum -30?  
Minimum, average, and maximum – 10, 20, 30?   Something else? 

Thanks!

Otis


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can i search lucene java user list archive?

2011-10-20 Thread Otis Gospodnetic
Have a look at http://search-lucene.com/ where you can search Lucene mailing 
list archives (user, dev, common) its web site, wiki, source code, jira, etc. 
as well as the same types of data for Solr, Nutch, and so on.

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: janwen tom.grade1...@163.com
To: java-user java-user@lucene.apache.org
Sent: Thursday, October 20, 2011 4:46 AM
Subject: How can i search lucene java user list archive?

I want to know how to search the java user list archive.
There is no search function on the 
site:http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 
Any idea?
thanks

2011-10-20



janwen | China 
website : http://www.qianpin.com/



Hit search-lucene.com a little harder

2011-10-18 Thread Otis Gospodnetic
Hello folks,

Do you ever use http://search-lucene.com (SL) or http://search-hadoop.com (SH)?

If you do, I'd like to ask you for a small favour:
We are at Lucene Eurocon in Barcelona and we are about to show the Search 
Analytics [1] and Performance Monitoring [2] tools/services we've built and 
that we use on these two sites.
We would like to show the audience various pretty graphs and would love those 
graph to be a little less sparse. :)

So if you use SL and/or SH, please feel free to use them a little extra now, 
if you feel like helping.

[1] http://sematext.com/search-analytics/index.html
[2] http://sematext.com/spm/solr-performance-monitoring/index.html

I think we'll open up both of the above services to the public tomorrow (and 
100% free for undetermined length of time), but if you don't have time to sign 
up and set it up for yourself, yet are interested in reports, graphs, etc., let 
me know and we'll put together a blog post or something and include interesting 
things in it.

Thanks,
Otis


Re: OutOfMemoryError

2011-10-18 Thread Otis Gospodnetic
Bok Tamara,

You didn't say what -Xmx value you are using.  Try a little higher value.  Note 
that loading field values (and it looks like this one may be big because is 
compressed) from a lot of hits is not recommended.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Tamara Bobic tamara.bo...@scai.fraunhofer.de
To: java-user@lucene.apache.org
Cc: Roman Klinger roman.klin...@scai.fraunhofer.de
Sent: Tuesday, October 18, 2011 12:21 PM
Subject: OutOfMemoryError

Hi all,

I am using Lucene to query Medline abstracts and as a result I get around 3 
million hits. Each of the hits is processed and information from a certain 
field is used.

After certain number of hits, somewhere around 1 million (not always the same 
number) I get OutOfMemory exception that looks like this:

Exception in thread main java.lang.OutOfMemoryError
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:221)
    at java.util.zip.Inflater.inflate(Inflater.java:238)
    at 
org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108)
    at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:609)
    at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:385)
    at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:231)
    at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:1013)
    at 
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:520)
    at 
org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:149)
    at org.apache.lucene.index.IndexReader.document(IndexReader.java:947)
    at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:152)
    at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:156)
    at org.apache.lucene.search.Hits.doc(Hits.java:180)
    at 
de.fhg.scai.bio.tamara.corpusBuilding.LuceneCmdLineInterface.queryMedline(LuceneCmdLineInterface.java:178)
    at 
de.fhg.scai.bio.tamara.corpusBuilding.LuceneCmdLineInterface.main(LuceneCmdLineInterface.java:152)


this line which causes problems is:
String docText = hits.doc(j).getField(DOCUMENT).stringValue() ; 

I am using java 1.6 and I tried solving this issue with different garbage 
collectors (-XX:+UseParallelGC and -XX:+UseParallelOldGC) but it didn't help.

Does anyone have any idea how to solve this problem?

There is also an official bug report:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6293787

Help is much appreciated. :)

Best regards,
Tamara Bobic

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Castle for Lucene/Solr?

2011-09-03 Thread Otis Gospodnetic
Hello,

I saw mentions of something called Caste a while back, but only now looked at 
what it is, and it sounds like something that's potentially interesting/useful 
(performance-wise) for Lucene/Solr.

See http://twitter.com/#!/otisg/status/109768673467699200


Has anyone tried it with Lucene/Solr by any chance?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: distributing the indexing process

2011-07-06 Thread Otis Gospodnetic
We've used Hadoop MapReduce with Solr to parallelize indexing for a customer 
and that brought down their multi-hour indexing process down to a couple of 
minutes.  There is/was also Lucene-level contrib in Hadoop that makes use of 
MapReduce to parallelize indexing.

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


- Original Message -
 From: Guru Chandar guru.chan...@consona.com
 To: java-user@lucene.apache.org
 Cc: 
 Sent: Thursday, June 30, 2011 5:12 AM
 Subject: distributing the indexing process
 
 
 
 If we have to index a lot of documents, is there a way to divide the
 documents into multiple sets and index them on multiple machines in
 parallel, and then merge the resulting indexes back into a single
 machine? If yes, will the result be logically equivalent to indexing all
 the documents on a single machine?
 
 
 
 Thanks,
 
 -gc


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How do I sort lucene search results by relevance and time?

2011-05-11 Thread Otis Gospodnetic
If only you were using Solr 
http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Johnbin Wang johnbin.w...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Sun, May 8, 2011 11:59:11 PM
 Subject: How do I sort lucene search results by relevance and time?
 
 What do I want to do is just like Google search results.  The results in  the
 first page is the most relevant and also recent documents, but  not
 absolutely sorted by  time desc.
 
 -- 
 cheers,
 Johnbin  Wang
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: AW: AW: AW: AW: fuzzy prefix search

2011-05-04 Thread Otis Gospodnetic
We do have EdgeNGramTokenizer if that is what you are after.
See how Solr uses it here:
http://search-lucene.com/c/Solr:/src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java||EdgeNGramTokenizer


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Clemens Wyss clemens...@mysign.ch
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Wed, May 4, 2011 2:07:40 AM
 Subject: AW: AW: AW: AW: fuzzy prefix search
 
 I know this is just an example.
 But even the WhitespaceAnalyzer takes the  words apart, which I don't want. I 
would like the phrases as they are (maximum 3  words, e.g. Merlot del 
Ticino, 
...) to be n-gram-ed. I hence want to have the  n-grams.
 Mer
 Merl
 Merlo
 Merlot
 Merlot
 Merlot  d
 ...
 
 Regards
 Clemens
  -Ursprüngliche  Nachricht-
  Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Gesendet: Dienstag, 3. Mai 2011 23:12
  An: java-user@lucene.apache.org
   Betreff: Re: AW: AW: AW: fuzzy prefix search
 
  Clemens - that's  just an example.  Stick another tokenizer in there, like
   WhitespaceTokenizer in there, for example.
 
  Otis
   
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
   search :: http://search-lucene.com/
 
 
 
  - Original  Message 
   From: Clemens Wyss clemens...@mysign.ch
   To:  java-user@lucene.apache.org  java-user@lucene.apache.org
Sent: Tue, May 3, 2011 4:31:14 PM
   Subject: AW: AW: AW: fuzzy  prefix search
  
   But doesn't the KeyWordTokenizer  extract single words out oft he
  stream? I would  like to create  n-grams on the stream (field content) as 
it
  is...
  
  -Ursprüngliche Nachricht-
Von: Otis  Gospodnetic [mailto:otis_gospodne...@yahoo.com]
  Gesendet: Dienstag, 3. Mai 2011 21:31
An: java-user@lucene.apache.org
  Betreff: Re: AW: AW: fuzzy prefix search

Clemens,
   
Something a  la:
   
public TokenStream tokenStream  (String  fieldName, Reader r) {
  return nw  EdgeNGramTokenFilter(new  KeywordTokenizer(r),
 EdgeNGramTokenFilter.Side.FRONT, 1, 4); }
   

Check out page 265 of Lucene in Action 2.

 Otis

 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene
 ecosystem search :: http://search-lucene.com/
   
   

- Original  Message 
  From: Clemens Wyss clemens...@mysign.ch
  To:  java-user@lucene.apache.org   java-user@lucene.apache.org
   Sent: Tue, May 3, 2011 12:57:39 PM
  Subject: AW: AW: fuzzy  prefix search

  How does an simple Analyzer look that  just n-grams the   
docs/fields.

 class   SimpleNGramAnalyzer extends  Analyzer {  @Override
  public TokenStream tokenStream ( String fieldName,   Reader reader  )
 {
  EdgeNGramTokenFilter...  ???
 }
  }
 
   -Ursprüngliche Nachricht-
  Von:   Otis  Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Gesendet: Dienstag, 3. Mai 2011 13:36
   An: java-user@lucene.apache.org
 Betreff: Re: AW: fuzzy prefix search
   
  Hi,
  
  I  didn't  read this thread closely,  but just in case:
  * Is this  something   you can handle with synonyms?
  * If this is for   English and you are  trying to handle typos,
   there is a
  list
 of
   common English misspellings  out there that you  could use  for
  this
perhaps.
   * Have you  considered  n-gramming your tokens?   Not sure if
  this would
  help,
  didn't read  messages/examples closely  enough, but  you may want
  to
 look at
  this if  you haven't done  so  yet.
 
  Otis
   
   Sematext :: http://sematext.com/ :: Solr  -  Lucene - Nutch
Lucene  ecosystem
   search :: http://search-lucene.com/
 
   
 
  -  Original  Message  
   From: Clemens  Wyss clemens...@mysign.ch
 To:  java-user@lucene.apache.orgjava-
  u...@lucene.apache.org
  Sent: Tue, May 3, 2011 5:25:30 AM
 Subject: AW: fuzzy prefix  search
   
 PrefixQuery
   I'd like the   combination  of prefix and fuzzy ;-) because
people
  could
   also   type menlo or märl and in any of these cases I'd
   like  to
  get
   a hit on  Merlot (for suggesting  Merlot)
  
   -Ursprüngliche   Nachricht-
Von: Ian  Lea   [mailto:ian@gmail.com]
   Gesendet:  Dienstag, 3. Mai 2011 11:22An:
   java-user@lucene.apache.org
Betreff: Re: fuzzy prefix  search

  I'd assumed that  FuzzyQuery  wouldn't ignore  case but I
  could be
wrong.
   What would be the edit  distance  between  mer  and 
merlot?
Would
 it be less that 1.5  which I   reckon would  be the value of
 length(term)*0.5 as  detailed in  the  javadocs?  Seems

Re: AW: fuzzy prefix search

2011-05-03 Thread Otis Gospodnetic
Hi,

I didn't read this thread closely, but just in case:
* Is this something you can handle with synonyms?
* If this is for English and you are trying to handle typos, there is a list of 
common English misspellings out there that you could use for this perhaps.
* Have you considered n-gramming your tokens?  Not sure if this would help, 
didn't read messages/examples closely enough, but you may want to look at this 
if you haven't done so yet.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Clemens Wyss clemens...@mysign.ch
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Tue, May 3, 2011 5:25:30 AM
 Subject: AW: fuzzy prefix search
 
 PrefixQuery
 I'd like the combination of prefix and fuzzy ;-) because  people could also 
type menlo or märl and in any of these cases I'd like to  get a hit on 
Merlot (for suggesting Merlot)
 
  -Ursprüngliche  Nachricht-
  Von: Ian Lea [mailto:ian@gmail.com]
  Gesendet:  Dienstag, 3. Mai 2011 11:22
  An: java-user@lucene.apache.org
   Betreff: Re: fuzzy prefix search
  
  I'd assumed that FuzzyQuery  wouldn't ignore case but I could be wrong.
   What would be the edit  distance between mer and merlot? Would it be
  less that 1.5 which I  reckon would be the value of length(term)*0.5 as
  detailed in the  javadocs?  Seems unlikely, but I don't really know anything
  about  the Levenshtein (edit distance) algorithm as used by FuzzyQuery.
   Wouldn't a PrefixQuery be more appropriate here?
  
  
   --
  Ian.
  
  On Tue, May 3, 2011 at 10:10 AM, Clemens Wyss  clemens...@mysign.ch
   wrote:
   Unfortunately lowercasing doesn't help.
   Also,  doesn't the FuzzyQuery ignore casing?
  
-Ursprüngliche Nachricht-
   Von: Ian Lea [mailto:ian@gmail.com]
Gesendet: Dienstag, 3. Mai 2011 11:06
   An: java-user@lucene.apache.org
Betreff: Re: fuzzy prefix search
  
Mer != mer.  The latter will be what is indexed because
StandardAnalyzer calls LowerCaseFilter.
  
--
   Ian.
  
  
   On  Tue, May 3, 2011 at 9:56 AM, Clemens Wyss
  clemens...@mysign.ch
wrote:
Sorry for coming back to my issue. Can anybody  explain why my
  simple
   unit test below fails? Any  hint/help appreciated.
   
Directory  directory = new RAMDirectory(); IndexWriter indexWriter =
 new IndexWriter( directory, new StandardAnalyzer(
   Version.LUCENE_31
), IndexWriter.MaxFieldLength.UNLIMITED  ); Document document =
  new
Document();  document.add( new Field( test, Merlot,
 Field.Store.YES, Field.Index.ANALYZED ) ); indexWriter.addDocument(
 document ); IndexReader indexReader =  indexWriter.getReader();
IndexSearcher searcher = new  IndexSearcher( indexReader ); Query q
= new FuzzyQuery(  new Term( test, Mer ), 0.5f, 0, 10 ); // or
Query q =  new FuzzyQuery( new Term( test, Mer ), 0.5f); TopDocs
 result = searcher.search( q, 10 ); Assert.assertEquals( 1,
 result.totalHits );
   
-  Clemens
   
-Ursprüngliche  Nachricht-
Von: Clemens Wyss [mailto:clemens...@mysign.ch]
 Gesendet: Montag, 2. Mai 2011 23:01
An: java-user@lucene.apache.org
 Betreff: AW: fuzzy prefix search

Is it the combination of FuzzyQuery and Term  which makes the
search to go for word  boundaries?
   
  -Ursprüngliche Nachricht-
 Von: Clemens  Wyss [mailto:clemens...@mysign.ch]
  Gesendet: Montag, 2. Mai 2011 14:13
  An: java-user@lucene.apache.org
  Betreff: AW: fuzzy prefix search
 
 I tried this too, but unfortunately  I only get hits when the
 search term is a least  as long as the word to be looked up.

  E.g.:
 ...
  Directory directory = new RAMDirectory(); IndexWriter
  indexWriter = new IndexWriter( directory,
  IndexManager.getIndexingAnalyzer(
 LOCALE_DE ),
  IndexWriter.MaxFieldLength.UNLIMITED );

  Document document = new Document(); document.add( new  Field(
 test, Merlot,
  Field.Store.YES, Field.Index.ANALYZED ) );
 indexWriter.addDocument(
  document );

  IndexReader indexReader = indexWriter.getReader(); IndexSearcher
  searcher = new IndexSearcher( indexReader );
 
 Query q = new FuzzyQuery(  new Term( test, Mer ), 0.6f, 1 );
 TopDocs  result = searcher.search( q, 10 ); Assert.assertEquals(
  1,
result.totalHits ); ...
 
  -Ursprüngliche  Nachricht-
  Von: Uwe Schindler [mailto:u...@thetaphi.de]
   Gesendet: Montag, 2. Mai 2011 13:50
   An: java-user@lucene.apache.org
   Betreff: RE: fuzzy prefix search
  
  Hi,
  
  You can pass an integer  to FuzzyQuery which defines the number
  of  characters that are seen as prefix. So all terms must match
   this prefix and the rest of each term is matched using  fuzzy.
 
   Uwe
 
   -
  Uwe 

Re: AW: AW: fuzzy prefix search

2011-05-03 Thread Otis Gospodnetic
Clemens,

Something a la:

public TokenStream tokenStream (String fieldName, Reader r) {
  return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), 
EdgeNGramTokenFilter.Side.FRONT, 1, 4);
}


Check out page 265 of Lucene in Action 2.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Clemens Wyss clemens...@mysign.ch
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Tue, May 3, 2011 12:57:39 PM
 Subject: AW: AW: fuzzy prefix search
 
 How does an simple Analyzer look that just n-grams the  docs/fields.
 
 class SimpleNGramAnalyzer extends  Analyzer
 {
 @Override
 public TokenStream tokenStream ( String fieldName,  Reader reader )
 {
EdgeNGramTokenFilter...  ???
 }
 }
 
  -Ursprüngliche Nachricht-
  Von: Otis  Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Gesendet: Dienstag, 3. Mai 2011 13:36
  An: java-user@lucene.apache.org
   Betreff: Re: AW: fuzzy prefix search
  
  Hi,
  
  I  didn't read this thread closely, but just in case:
  * Is this something  you can handle with synonyms?
  * If this is for English and you are  trying to handle typos, there is a 
  list 
of
  common English misspellings  out there that you could use for this perhaps.
  * Have you considered  n-gramming your tokens?  Not sure if this would help,
  didn't read  messages/examples closely enough, but you may want to look at
  this if  you haven't done so yet.
  
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch Lucene ecosystem
  search :: http://search-lucene.com/
  
  
  
  - Original  Message 
   From: Clemens Wyss clemens...@mysign.ch
   To:  java-user@lucene.apache.org  java-user@lucene.apache.org
Sent: Tue, May 3, 2011 5:25:30 AM
   Subject: AW: fuzzy prefix  search
  
   PrefixQuery
   I'd like the  combination of prefix and fuzzy ;-) because  people could
  also  type menlo or märl and in any of these cases I'd like to  get
   a hit on Merlot (for suggesting Merlot)
  
 -Ursprüngliche  Nachricht-
Von: Ian Lea  [mailto:ian@gmail.com]
 Gesendet:  Dienstag, 3. Mai 2011 11:22
An: java-user@lucene.apache.org
  Betreff: Re: fuzzy prefix search
   
 I'd assumed that FuzzyQuery  wouldn't ignore case but I could be  
wrong.
 What would be the edit  distance between  mer and merlot? Would
it be less that 1.5 which I   reckon would be the value of
length(term)*0.5 as detailed in  the  javadocs?  Seems unlikely, but
I don't really  know anything about  the Levenshtein (edit distance)
  algorithm as  used by FuzzyQuery.
 Wouldn't a PrefixQuery be more  appropriate here?
   
   
  --
Ian.
   
On Tue, May 3,  2011 at 10:10 AM, Clemens Wyss
clemens...@mysign.ch
  wrote:
 Unfortunately lowercasing doesn't  help.
 Also,  doesn't the FuzzyQuery ignore  casing?

   -Ursprüngliche Nachricht-
 Von: Ian Lea  [mailto:ian@gmail.com]
   Gesendet: Dienstag, 3. Mai 2011 11:06
  An: java-user@lucene.apache.org
   Betreff: Re: fuzzy prefix search
 
  Mer != mer.  The latter will be  what is indexed because
 StandardAnalyzer calls  LowerCaseFilter.

   --
 Ian.

 
 On  Tue, May 3, 2011 at 9:56 AM,  Clemens Wyss
clemens...@mysign.ch
   wrote:
  Sorry for coming back  to my issue. Can anybody  explain why my
simple
  unit test below fails? Any  hint/help  appreciated.
 
   Directory  directory = new RAMDirectory(); IndexWriter
   indexWriter =  new IndexWriter( directory, new
   StandardAnalyzer(
  Version.LUCENE_31
  ),  IndexWriter.MaxFieldLength.UNLIMITED  ); Document document
   =
new
  Document();   document.add( new Field( test, Merlot,
   Field.Store.YES, Field.Index.ANALYZED ) );
   indexWriter.addDocument(
   document );  IndexReader indexReader =
indexWriter.getReader();
   IndexSearcher searcher = new  IndexSearcher(  indexReader );
  Query q = new FuzzyQuery(   new Term( test, Mer ), 0.5f, 0,
  10 ); // or  Query q =  new FuzzyQuery( new Term( test, Mer
   ), 0.5f); TopDocs  result = searcher.search( q, 10 );
   Assert.assertEquals( 1,  result.totalHits  );
 
  -   Clemens
 
   -Ursprüngliche  Nachricht-
  Von:  Clemens Wyss [mailto:clemens...@mysign.ch]
Gesendet: Montag, 2. Mai 2011 23:01
   An: java-user@lucene.apache.org
Betreff: AW: fuzzy prefix search
   
  Is it the  combination of FuzzyQuery and Term  which makes the
   search to go for word  boundaries?
  
 -Ursprüngliche Nachricht-
   Von:  Clemens  Wyss [mailto:clemens...@mysign.ch]
 Gesendet: Montag, 2. Mai 2011 14:13
 An: java-user@lucene.apache.org
 Betreff: AW: fuzzy prefix  search
   
I tried this too, but unfortunately  I only get hits  when

Re: AW: AW: AW: fuzzy prefix search

2011-05-03 Thread Otis Gospodnetic
Clemens - that's just an example.  Stick another tokenizer in there, like 
WhitespaceTokenizer in there, for example.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Clemens Wyss clemens...@mysign.ch
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Tue, May 3, 2011 4:31:14 PM
 Subject: AW: AW: AW: fuzzy prefix search
 
 But doesn't the KeyWordTokenizer extract single words out oft he stream? I 
would  like to create n-grams on the stream (field content) as it is...
 
   -Ursprüngliche Nachricht-
  Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Gesendet: Dienstag, 3. Mai 2011 21:31
  An: java-user@lucene.apache.org
   Betreff: Re: AW: AW: fuzzy prefix search
  
  Clemens,
  
  Something a la:
  
  public TokenStream tokenStream (String  fieldName, Reader r) {
return nw EdgeNGramTokenFilter(new  KeywordTokenizer(r),
  EdgeNGramTokenFilter.Side.FRONT, 1, 4); }
  
  
  Check out page 265 of Lucene in Action 2.
  
   Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene  ecosystem search :: http://search-lucene.com/
  
  
  
  - Original  Message 
   From: Clemens Wyss clemens...@mysign.ch
   To:  java-user@lucene.apache.org  java-user@lucene.apache.org
Sent: Tue, May 3, 2011 12:57:39 PM
   Subject: AW: AW: fuzzy  prefix search
  
   How does an simple Analyzer look that  just n-grams the  docs/fields.
  
   class  SimpleNGramAnalyzer extends  Analyzer
   {
@Override
   public TokenStream tokenStream ( String fieldName,   Reader reader )
   {
   EdgeNGramTokenFilter...  ???
   }
   }
   
-Ursprüngliche Nachricht-
Von:  Otis  Gospodnetic [mailto:otis_gospodne...@yahoo.com]
  Gesendet: Dienstag, 3. Mai 2011 13:36
An: java-user@lucene.apache.org
  Betreff: Re: AW: fuzzy prefix search

Hi,
   
I  didn't  read this thread closely, but just in case:
* Is this  something  you can handle with synonyms?
* If this is for  English and you are  trying to handle typos, there is 
a 
list
   of
common English misspellings  out there that you  could use for this
  perhaps.
* Have you  considered  n-gramming your tokens?  Not sure if this would
   help,
didn't read  messages/examples closely enough, but  you may want to
  look at
this if  you haven't done  so yet.
   
Otis

 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch Lucene
   ecosystem
search :: http://search-lucene.com/
   

   
- Original  Message  
 From: Clemens Wyss clemens...@mysign.ch
  To:  java-user@lucene.apache.org   java-user@lucene.apache.org
   Sent: Tue, May 3, 2011 5:25:30 AM
  Subject: AW: fuzzy prefix  search

  PrefixQuery
 I'd like the  combination  of prefix and fuzzy ;-) because  people 
could
 also  type menlo or märl and in any of these cases I'd like  to  
get
 a hit on Merlot (for suggesting  Merlot)

-Ursprüngliche  Nachricht-
  Von: Ian  Lea  [mailto:ian@gmail.com]
Gesendet:  Dienstag, 3. Mai 2011 11:22
   An: java-user@lucene.apache.org
 Betreff: Re: fuzzy prefix search
  
   I'd assumed that  FuzzyQuery  wouldn't ignore case but I could be
  wrong.
What would be the edit  distance between  mer  and merlot?
  Would
  it be less that 1.5  which I   reckon would be the value of
   length(term)*0.5 as detailed in  the  javadocs?  Seems unlikely,  
but
  I don't really  know anything about   the Levenshtein (edit 
distance)
algorithm as  used by  FuzzyQuery.
   Wouldn't a PrefixQuery be  more  appropriate here?
 
  
--
   Ian.
 
  On Tue, May  3,  2011 at 10:10 AM, Clemens Wyss
  clemens...@mysign.ch
 wrote:
   Unfortunately  lowercasing doesn't  help.
   Also,   doesn't the FuzzyQuery ignore  casing?
   
 -Ursprüngliche  Nachricht-
   Von: Ian Lea   [mailto:ian@gmail.com]
  Gesendet: Dienstag, 3. Mai 2011 11:06
 An: java-user@lucene.apache.org
  Betreff: Re: fuzzy prefix  search
   
 Mer != mer.  The latter will be  what is indexed  because
   StandardAnalyzer calls   LowerCaseFilter.
  
  --
   Ian.
   
   
On  Tue, May 3, 2011 at 9:56 AM,  Clemens  Wyss
  clemens...@mysign.ch
  wrote:
 Sorry for coming back  to my issue. Can anybody  explain why  
my
  simple
 unit test below fails? Any  hint/help  appreciated.

  Directory  directory = new RAMDirectory(); IndexWriter
  indexWriter =  new IndexWriter(  directory, new
  StandardAnalyzer(
Version.LUCENE_31
 ),   IndexWriter.MaxFieldLength.UNLIMITED  ); Document
  document
  =
  new
 Document();   document.add( new Field( test

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-02 Thread Otis Gospodnetic
Hi,

I think this describes what's going on:

10 load N stored queries
20 parse N stored queries, keep them in some List forever
30 for each incoming document create a new MemoryIndex instance mi
40 for query 1 to N do mi.search(query)

Over time this step 40 takes longer and longer and longer -- if some of the 
queries are MultiPhraseQueries.  This is even with with mergeSort being used in 
MultiPhraseQuery.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-user@lucene.apache.org
 Sent: Mon, May 2, 2011 12:15:40 PM
 Subject: Re: MultiPhraseQuery slowing down over time in Lucene 3.1
 
 By slowing down over time do you mean you use the same index (no new
 docs  added) yet running the same MPQ over and over you see it taking
 longer to  execute over time?
 
 Mike
 
 http://blog.mikemccandless.com
 
 On Mon, May 2, 2011 at  12:00 PM, Tomislav Poljak tpol...@gmail.com wrote:
   Hi,
  after running tests on both MemoryIndex and RAMDirectory based  index
  in Lucene 3.1, seems MultiPhraseQueries are slowing down over  time
  (each iteration of executing the same MultiPhraseQueries on the  same
  doc, seems to require more and more execution time). Are there  any
  existing/known issues related to the MultiPhraseQuery in Lucene  3.1
  which could lead to this performance drop?
 
   Tomislav
 
   -
  To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Thoughts on Search Analytics?

2011-05-01 Thread Otis Gospodnetic
Hi,

I'd like to solicit your thoughts about Search Analytics if you are doing any 
sort of analysis/reporting of search logs or click stream or anything related.

* Which information or reports do you find the most useful and why?
* Which reports would you like to have, but don't have for whatever reason 
(don't have the needed data, or it's too hard to produce such reports, or ...)
* Which tool(s) or service(s) do you use and find the most useful?

I'm preparing a presentation on the topic of Search Analytics, so I'm trying to 
solicit opinions, practices, desires, etc. on this topic.

Your thoughts would be greatly appreciated.  If you could reply directly, that 
would be great, since this may be a bit OT for the list.

Thanks!
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
Hi,

OK, so it looks like it's not MemoryIndex and its Comparator that are funky.  
After switching from quickSort call in MemoryIndex to mergeSort, the problem 
persists:

'1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=, total cpu 
time=497060.ms user time=495210.msat 
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105) 

at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) 
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) 
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
So something else is calling quickSort when it gets stuck.  Weirdly, when I get 
a thread dump and get the above, I don't see the original caller.  Maybe 
because 
the stack is already too deep and the printout is limited to N lines per call 
stack?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Uwe Schindler u...@thetaphi.de
 To: java-user@lucene.apache.org
 Sent: Thu, April 28, 2011 5:54:44 PM
 Subject: RE: SorterTemplate.quickSort causes StackOverflowError
 
  Thanks for confirming, Javier! :)
  
  Uwe, I assume you are  referring to this line 528 in MemoryIndex?
  
   528 if (size  1) ArrayUtil.quickSort(entries,  termComparator);
  
  And this funky Comparator from  MemoryIndex:
  
  208   private static final  ComparatorObject termComparator = new
  ComparatorObject()  {
  209  @SuppressWarnings(unchecked)
  210 public  int compare(Object o1, Object o2) {
  211if (o1 instanceof Map.Entry?,?) o1 =  ((Map.Entry?,?)
  o1).getKey();
  212if (o2 instanceof Map.Entry?,?) o2 =  ((Map.Entry?,?)
  o2).getKey();
  213if (o1 == o2) return 0;
  214return ((Comparable) o1).compareTo((Comparable) o2);
   215 }
  216   };
  
   Will try, thanks!
 
 Yeah, simply try with mergeSort in line 528. If that  helps, this comparator
 is buggy.
 
 Uwe
 
 
  - Original  Message 
   From: Uwe Schindler u...@thetaphi.de
   To: java-user@lucene.apache.org
Sent: Thu, April 28, 2011 5:36:13 PM
   Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
  
   Hi  Otis,
  
   Can you reproduce this somehow and send test  code? I could look  into
   it. I don't expect the error in the  quicksort algorithm itself as this
   one is used e.g. BytesRefHash /  TermsHash, if there is a bug we would
   have  seen it long time  ago.
  
   I have not seen this before, but I suspect  a  problem in this very
   strange comparator in MemoryIndex  (which is very broken,  if you look
   at its code - it can  compare Strings with Map.Entry and so on,
   b), maybe the  comparator is not stable? In this case, quicksort
   can  easily  loop endless and stack overflow. In Lucene 3.0 this used
   stock  java  sort (which is mergesort), maybe replace the
ArrayUtils.quickSort my  ArrayUtils.mergeSort() and see if problem  is
 still
  there?
  
   Uwe
  
-
   Uwe Schindler
   H.-H.-Meier-Allee 63,  D-28213  Bremen
   http://www.thetaphi.de
   eMail: u...@thetaphi.de
  
   
-Original  Message-
From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
  Sent: Thursday, April 28, 2011 11:17 PM
To: java-user@lucene.apache.org
  Subject: SorterTemplate.quickSort causes  StackOverflowError
   
 Hi,

I'm looking at some code that uses MemoryIndex (Lucene  3.1)  and
that's exhibiting a strange behaviour - it  slows down over  time.
The MemoryIndex contains 1 doc, of  course, and executes a set of a
few thousand queries against  it.  The set of queries does not
change - the
same
set of queries gets executed on all incoming   documents.
This code runs very quickly. in the  beginning.   But  with time is
 gets
slower and  slower and slower. and then I get  this:
   
 4/28/11 10:32:52 PM (S) SolrException.log  :
 java.lang.StackOverflowError
at

   org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
  at
   
   org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
  at
   
 org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:
 104)
   
I haven't profiled this code  yet (remote server, firewall in
between,
   can't  use
YourKit...), but does the above look familiar to   anyone?
I've looked at the code and obviously there is the  recursive  call
that's problematic here - it looks like  the recursion just gets
deeper and deeper
and
gets stuck, eventually getting too deep for  the  JVM's taste.
   
Thanks,
 Otis

 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch Lucene
ecosystem  search :: http://search-lucene.com

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
Hi,

Yeah, that's what we were going to do, but instead we did:
* changed MemoryIndex to use ArrayUtil.mergeSort
* ran the up and did a thread dump that shows that SorterTemplate.quickSort in 
deep recursion again!
* looked for other places where this call is made - found it in 
MultiPhraseQuery$MultiPhraseWeight and changed that call from 
ArrayUtil.quickSort to ArrayUtil.mergeSort
* now we no longer see SorterTemplate.quickSort in deep recursion when we do a 
thread dump
* we now occasionally catch SorterTemplate.mergeSort in our thread dumps, but 
only a few levels deep, which looks healthy

I don't think we'll be able to reproduce this easily - this happens with 
MemoryIndex and a few thousand stored queries that are confidential customer 
data :(

I'll be back if after a while mergeSort starts behaving the same as quickSort.

Thanks!
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Dawid Weiss dawid.we...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Fri, April 29, 2011 7:51:39 AM
 Subject: Re: SorterTemplate.quickSort causes StackOverflowError
 
 Don't know if this helps, but debugging stuff like this I simply add  a
 (manually inserted or aspectj-injected) recursion count, add a  breakpoint
 inside an if checking for recursion count  X and run the  vm with an
 attached socket debugger. This lets you run at (nearly) full speed  and once
 you hit the breakpoint, inspect the stack, variables,  etc...
 
 Dawid
 
 On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic  
 otis_gospodne...@yahoo.com  wrote:
 
  Hi,
 
  OK, so it looks like it's not MemoryIndex  and its Comparator that are
  funky.
  After switching from  quickSort call in MemoryIndex to mergeSort, the
  problem
   persists:
 
  '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=,  total cpu
  time=497060.ms user time=495210.msat
   org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105)
 
   at  
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
   at  
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
   at  
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
   So something else is calling quickSort when it gets stuck.  Weirdly, when  
I
  get
  a thread dump and get the above, I don't see the original  caller.  Maybe
  because
  the stack is already too deep and  the printout is limited to N lines per
  call
   stack?
 
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  - Original  Message 
   From: Uwe Schindler u...@thetaphi.de
   To: java-user@lucene.apache.org
Sent: Thu, April 28, 2011 5:54:44 PM
   Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
  
 Thanks for confirming, Javier! :)
   
Uwe,  I assume you are  referring to this line 528 in MemoryIndex?

 528 if (size  1)  ArrayUtil.quickSort(entries,
   termComparator);

And this funky Comparator from  MemoryIndex:

208   private static final   ComparatorObject termComparator = new
 ComparatorObject()  {
209   @SuppressWarnings(unchecked)
 210 public  int compare(Object o1, Object o2) {
 211if (o1 instanceof  Map.Entry?,?) o1 =
   ((Map.Entry?,?)
 o1).getKey();
212 if (o2 instanceof Map.Entry?,?) o2 =
((Map.Entry?,?)
o2).getKey();
 213if (o1 == o2) return 0;
 214return ((Comparable)  o1).compareTo((Comparable) o2);
 215  }
216   };

 Will try, thanks!
  
   Yeah,  simply try with mergeSort in line 528. If that  helps, this
   comparator
   is buggy.
  
   Uwe
   
  
- Original  Message 
  From: Uwe Schindler u...@thetaphi.de
 To: java-user@lucene.apache.org
   Sent: Thu, April 28, 2011 5:36:13 PM
  Subject: RE:  SorterTemplate.quickSort causes  StackOverflowError

 Hi   Otis,

 Can you reproduce this  somehow and send test  code? I could look
   into
  it. I don't expect the error in the  quicksort algorithm itself  as
  this
 one is used e.g. BytesRefHash /   TermsHash, if there is a bug we
  would
 have   seen it long time  ago.

 I  have not seen this before, but I suspect  a  problem in this  very
 strange comparator in MemoryIndex  (which is  very broken,  if you
  look
 at its code - it  can  compare Strings with Map.Entry and so on,
  b), maybe the  comparator is not stable? In this case,  quicksort
 can  easily  loop endless and stack  overflow. In Lucene 3.0 this used
 stock  java   sort (which is mergesort), maybe replace the
   ArrayUtils.quickSort my  ArrayUtils.mergeSort() and see if  problem
   is
   still
there?
 
 Uwe

   -
 Uwe Schindler
  H.-H.-Meier

Reusing Query instances

2011-04-29 Thread Otis Gospodnetic
Hi,

Is there any reason why one would *not* want to reuse Query instances?

I'm using MemoryIndex with a fixed set of queries and I'm executing them all on 
each new document that comes in.  Because each document needs to have many tens 
of thousands of queries executed against it, I thought I'd just run all queries 
through QueryParser once at the beginning, and then just reuse Query instances 
on each incoming document.  What I've noticed is that my fixed set of queries 
takes longer and longer to execute as time passes (more and more time is spent 
inside memoryIndex.search() somewhere).  The problem is not heap/memory - 
there is no crazy GCing and the heap is not full, but the CPU is 100% busy.

I should note that queries I'm dealing with are ugly and big, using lots of 
wildcards, but trailing and prefix ones (and this is Lucene 3.1, so no faster 
Wildcard impl).
I should also emphasize that at this point I only *suspect* that maaaybe the 
gradual slowdown I'm seeing has something to do with the fact that I'm reusing 
Query instances.

Is there any reason why one should not reuse Query instances?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SorterTemplate.quickSort causes StackOverflowError

2011-04-28 Thread Otis Gospodnetic
Hi,

I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's 
exhibiting a strange behaviour - it slows down over time.
The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand 
queries against it.  The set of queries does not change - the same set of 
queries gets executed on all incoming documents.
This code runs very quickly. in the beginning.   But with time is gets 
slower and slower and slower. and then I get this:

4/28/11 10:32:52 PM (S) SolrException.log : java.lang.StackOverflowError
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)

I haven't profiled this code yet (remote server, firewall in between, can't use 
YourKit...), but does the above look familiar to anyone?
I've looked at the code and obviously there is the recursive call that's 
problematic here - it looks like the recursion just gets deeper and deeper and 
gets stuck, eventually getting too deep for the JVM's taste.

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-28 Thread Otis Gospodnetic
Thanks for confirming, Javier! :)

Uwe, I assume you are referring to this line 528 in MemoryIndex?

528 if (size  1) ArrayUtil.quickSort(entries, termComparator);

And this funky Comparator from MemoryIndex:

208   private static final ComparatorObject termComparator = new 
ComparatorObject() {
209 @SuppressWarnings(unchecked)
210 public int compare(Object o1, Object o2) {
211   if (o1 instanceof Map.Entry?,?) o1 = ((Map.Entry?,?) 
o1).getKey();
212   if (o2 instanceof Map.Entry?,?) o2 = ((Map.Entry?,?) 
o2).getKey();
213   if (o1 == o2) return 0;
214   return ((Comparable) o1).compareTo((Comparable) o2);
215 }
216   };

Will try, thanks!

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Uwe Schindler u...@thetaphi.de
 To: java-user@lucene.apache.org
 Sent: Thu, April 28, 2011 5:36:13 PM
 Subject: RE: SorterTemplate.quickSort causes StackOverflowError
 
 Hi Otis,
 
 Can you reproduce this somehow and send test code? I could look  into it. I
 don't expect the error in the quicksort algorithm itself as this  one is used
 e.g. BytesRefHash / TermsHash, if there is a bug we would have  seen it long
 time ago.
 
 I have not seen this before, but I suspect a  problem in this very strange
 comparator in MemoryIndex (which is very broken,  if you look at its code -
 it can compare Strings with Map.Entry and so on,  b), maybe the
 comparator is not stable? In this case, quicksort can  easily loop endless
 and stack overflow. In Lucene 3.0 this used stock java  sort (which is
 mergesort), maybe replace the ArrayUtils.quickSort my  ArrayUtils.mergeSort()
 and see if problem is still  there?
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213  Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original  Message-
  From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Sent: Thursday, April 28, 2011 11:17 PM
  To: java-user@lucene.apache.org
   Subject: SorterTemplate.quickSort causes StackOverflowError
  
   Hi,
  
  I'm looking at some code that uses MemoryIndex (Lucene 3.1)  and that's
  exhibiting a strange behaviour - it slows down over  time.
  The MemoryIndex contains 1 doc, of course, and executes a set of a  few
  thousand queries against it.  The set of queries does not  change - the
 same
  set of queries gets executed on all incoming  documents.
  This code runs very quickly. in the beginning.   But  with time is gets
  slower and slower and slower. and then I get  this:
  
  4/28/11 10:32:52 PM (S) SolrException.log :  java.lang.StackOverflowError
  at
   org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
   at
   org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
   at
   org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
  
  I haven't profiled this code yet (remote server, firewall in  between,
 can't use
  YourKit...), but does the above look familiar to  anyone?
  I've looked at the code and obviously there is the recursive  call that's
  problematic here - it looks like the recursion just gets  deeper and deeper
 and
  gets stuck, eventually getting too deep for  the JVM's taste.
  
  Thanks,
  Otis
  
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
   search :: http://search-lucene.com/
  
  
   -
  To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT consistency

2011-04-11 Thread Otis Gospodnetic
I think what's being described here is a lot like what I *think* ElasticSearch 
does, where there is no single master and index changed made to any node get 
propagated to N-1 other nodes (N=number of index replicas).  I'm not sure how 
it 
deals with situations where incompatible index changes are made to the same 
index via 2 different nodes at the same time.  Is that what vector clocks are 
about?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Mark Miller markrmil...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Mon, April 11, 2011 11:52:05 AM
 Subject: Re: NRT consistency
 
 
 On Apr 10, 2011, at 4:34 AM, Em wrote:
 
  Hello list,
  
  I am currently trying to understand Lucene's Near-Real-Time-Feature  which
  was covered in Lucene in Action, Second Edition.
  
  Let's say I got a distributed system with a master and a slave.
  
  In Solr replication is solved by checking for any differences in  the
  index-directory and to consume those differences to keep indices  
consistent.
  
  How is this possible within a NRT-System? Is there  any possibility to
  consume snapshots of the internal buffer of the index  writer to send them 
to
  the slave?
 
 I think for near real time,  Solr index replication may not be appropriate. 
Though I think it would be cool  to use Andrzej's mythical single pass index 
splitter to create a single+ doc  segment that could be shipped around.
 
 Most likely, a system that just  sends each doc to each replica is probably 
going to work a lot better.  Introduces other issues of course - some of which 
we hope to alleviate with  further SolrCloud work.
 
  
  Regards,
  Em
  
  --
  View this message in context: 
http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2801878.html
   Sent from the Lucene - Java Users mailing list archive at Nabble.com.
  
   -
  To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
 
 - Mark Miller
 lucidimagination.com
 
 Lucene/Solr User  Conference
 May 25-26, San  Francisco
 www.lucenerevolution.org
 
 
 
 
 
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing Non-Textual Data

2011-04-06 Thread Otis Gospodnetic
Hi Chris,

Yes, people have done classification with Lucene before.  Have a look at 
http://search-lucene.com/?q=classifierfc_project=Lucene for some discussions 
and actual code (in old JIRA issues)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Chris Spencer chriss...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Wed, April 6, 2011 7:46:45 PM
 Subject: Indexing Non-Textual Data
 
 Hi,
 
 I'm new to Lucene, so forgive me if this is a newbie question. I have  a
 dataset composed of several thousand lists of 128 integer features,  each
 list associated with a class label. Would it be possible to use Lucene  as a
 classifier, by indexing the label with respect to these integer  features,
 and then classify a new list by finding the most similar labels  with Lucene?
 
 I'm specifically interested in doing so through the PyLucene  API, so I've
 been going through the PyLucene samples, but they only seem to  involve
 indexing text, not continuous features (understandably). Could anyone  point
 me to an example that indexes non-textual data?
 
 I think the  project Lire (http://www.semanticmetadata.net/lire/) is using
 Lucene to do something  similar to this, although with an emphasis on image
 features. I've dug into  their code a little, but I'm not a strong Java
 programmer, so I'm not sure  how they're pulling it off, nor how I might
 translate this into the PyLucene  API. In your opinion, is this a practical
 use of  Lucene?
 
 Regards,
 Chris
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Detecting duplicates

2011-03-08 Thread Otis Gospodnetic
Mark,

Keep in mind that there are actually multiple patches for this.  SOLR-236 and 
SOLR-1086 IIRC.
Also, I just noticed this is java-user@lucene.  You may want to continue on 
solr-user@lucene.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Mark static.void@gmail.com
 To: java-user@lucene.apache.org
 Sent: Sat, March 5, 2011 8:35:13 PM
 Subject: Re: Detecting duplicates
 
 I'm familiar with Deduplication however I do not wish to remove my 
 duplicates and my needs are slightly different. I would like to mark the 
 first document with signature 'xyz' as unique but the next one as a 
 duplicate. This way I can filter out duplicates during searching using 
 a filter query but still return the original document.
 
 The only thing  I know of at the moment is to use field collapsing but I 
 tried the patch on  1.4.1 and it was terribly slow.
 
 On 3/5/11 4:43 AM, Grant Ingersoll  wrote:
  See http://wiki.apache.org/solr/Deduplication.  Should be  fairly easy to 
pull out if you are doing just Lucene.
 
  On Mar 5,  2011, at 1:49 AM, Mark wrote:
 
  Is there a way one could  detect duplicates (say by using some unique hash 
of certain fields) and marking  a document as a duplicate but not remove it.
 
  Here is an  example:
 
  Doc 1) This is my test
  Doc 2) This  is my test
  Doc 3) Another test
  Doc 4) This is my  test
 
  Doc 1 and 3 should be considered unique whereas 2  and 4 should be marked 
  as 
duplicates (of doc 1).
 
  Can  this be easily accomplished?
 
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
 
   --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene  ecosystem docs using Solr/Lucene:
  http://www.lucidimagination.com/search
 
 
   -
  To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Backup or replication option with lucene

2011-03-02 Thread Otis Gospodnetic
Hi Ganesh,

You could probably use replication scripts from Solr.
But why not just use Solr?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Ganesh emailg...@yahoo.co.in
 To: java-user@lucene.apache.org
 Sent: Thu, March 3, 2011 12:03:20 AM
 Subject: Re: Backup or replication option with lucene
 
 Any suggestions. We are planning to move towords cloud and its become a  
mandatory requirement to have backup or replication of search  db.
 
 Regards
 Ganesh
 
 - Original Message - 
 From:  Ganesh emailg...@yahoo.co.in
 To: java-user@lucene.apache.org
 Sent:  Tuesday, March 01, 2011 12:06 PM
 Subject: Backup or replication option with  lucene
 
 
 Hello all,
 
 Could any one guide me how to backup or do  replication with Lucene. 
 
 Regards
 Ganesh
 Send free SMS to your  Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 Send  free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for multiple languages?

2011-01-18 Thread Otis Gospodnetic
Hi Clemens,

If you will be searching individual languages, go with language-specific 
indices.  Wunder likes to give an example of die in German vs. English. :)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Clemens Wyss clemens...@mysign.ch
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Tue, January 18, 2011 12:53:57 PM
 Subject: Best practices for multiple languages?
 
 What is the best practice to support multiple languages, i.e. 
Lucene-Documents  that have multiple language content/fields? 

 Should
 a) each language be  indexed in a seperate index/directory or should
 b) the Documents (in a single  directory) hold the diverse localized fields?
 
 We most often will be  searching language dependent which (at least 
performance wise) mandates  one-directory-per-language...
 
 Any (lucene specific) white papers on this  topic?
 
 Thx in  advance
 Clemens
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Otis Gospodnetic
 [X] ASF Mirrors (linked in our release announcements or via the Lucene  
website)

 
 [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr,  etc.)
 
 [X] I/we build them from source via an SVN/Git checkout.
 
 []  Other (someone in your company mirrors them internally or via a 
 downstream  
project)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: does lucene support Database full text search

2010-09-10 Thread Otis Gospodnetic
Hello,

You can use LuSQL to index DB content into Lucene.  Solr (the Lucene Server) 
has DataImportHandler for indexing data from DBs: 
http://search-lucene.com/?q=dataimporthandler

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: yang Yang m4ecli...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Fri, September 10, 2010 9:38:58 AM
 Subject: does lucene support Database full text search
 
 Hi:
 I am using MySql,and I want to use the full text search is rather  weak.
 So I use the Sphinx,however I found it can not support Chinese  work
 searching prefectly.
 So I wonder if Lucene can work better?
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Calculate Term Co-occurrence Matrix

2010-08-21 Thread Otis Gospodnetic
Ahmed,

That's what that KPE (link in my previous email, below) will do for you.  It's 
not open source at this time, but that is exactly one of the things it does.  I 
think Mahout collocations stuff might work for you, too.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: ahmed algohary algoharya...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Sat, August 21, 2010 7:20:03 AM
 Subject: Re: Calculate Term Co-occurrence Matrix
 
 Thanks for all your answers!
 
 it seems like I did not make my question  clear. I have a text corpus and I
 need to determine the pairs of words that  occur together in many documents.
 I need to do that to be able to measure the  semantic proximity between
 words. This method is expanded
 herehttp://forums.searchenginewatch.com/showthread.php?t=48.
 I hope to  find some code that given a text corpus, generate all the words
 pairs with  their probability of occurring together.
 
 
 On Sat, Aug 21, 2010 at 1:46  AM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com  wrote:
 
  There is also a non-Mahout Key Phrase Extractor for  Collocations, SIPs, and
  a
  few other things:
  http://sematext.com/products/key-phrase-extractor/index.html
 
   One of the demos that uses news data is at
  http://sematext.com/demo/kpe/index.html
 
  Otis
   
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem  search :: http://search-lucene.com/
 
 
 
  - Original  Message 
   From: Grant Ingersoll gsing...@apache.org
   To: java-user@lucene.apache.org
Sent: Fri, August 20, 2010 8:52:17 AM
   Subject: Re: Calculate  Term Co-occurrence Matrix
  
   You might also be interested  in Mahout's collocations package:
  http://cwiki.apache.org/confluence/display/MAHOUT/Collocations
   
   -Grant
   On  Aug 19, 2010, at 11:39 AM, ahmed  algohary wrote:
  
Hi all,
   
 I need to know if there is a Lucene plug-in or a Lucene-based  API  for
calculating the term co-occurrence matrix for a  given text  corpus.
   
Thanks!

--
 Ahmed
   
   --
   Grant  Ingersoll
   http://www.lucidimagination.com/
  
   Search the  Lucene ecosystem using  Solr/Lucene:
  http://www.lucidimagination.com/search
  
  
 -
To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For  additional commands, e-mail: java-user-h...@lucene.apache.org
   
  
 
   -
  To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene indexing configuration

2010-08-20 Thread Otis Gospodnetic
Hi,

Are you actually talking about Solr?  Sounds like it.  Check solr-u...@lucene 
list.

Maybe you need to treat those words are protected words?  See the protwords.txt 
file in the conf dir.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Shuai Weng sh...@genome.stanford.edu
 To: java-user@lucene.apache.org
 Sent: Fri, August 20, 2010 5:47:31 PM
 Subject: Re: lucene indexing configuration
 
 
 Hey,
 
 Currently we have indexed some biological full text pages,   I was wondering 
how to config the schema.xml such that 

 the gene names  'met1', 'met2', 'met3' will be treated as different words. 
Currently they are  all mapped to 'met'. 

 
 Thanks,
 Shuai
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: understanding lucene

2010-08-08 Thread Otis Gospodnetic
Manning, the Lucene in Action publisher, frequently offers 30-50% off on a 
number of their books, including LIA2.

See http://twitter.com/ManningBooks

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Yakob jacob...@opensuse-id.org
 To: java-user@lucene.apache.org
 Sent: Sun, August 8, 2010 5:54:38 AM
 Subject: Re: understanding lucene
 
 On 8/8/10, Uwe Schindler u...@thetaphi.de wrote:
  The example  code you found is very old (seems to be from the Version 1.x of
  Lucene),  and is not working with Version 2.x or 3.x of Lucene (previously
   deprecated Hits class is gone in 3.0, static Field constructors were
   gone long time ago in 2.0, so you get compilation errors).
 
  If  you want to learn Lucene, buy the Book Lucene in Action - 2nd Edition,
   there is everything explained and lots of examples for everyday use with  
the
  newest Version 3.0.2. See http://www.manning.com/hatcher2/ for ordering the
  PDF  version or go to your local bookstore.
 
  In all cases, if you are  new to Lucene don't use version 2.9.x or earlier,
  use 3.0.x with its  clean API. This makes it easier for beginners.
 
   Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63,  D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
 the ebook cost  30 dollars,can't I just get the free pirate version
 instead?hehe... I mean if  you had the ebook yourself maybe you can
 email me the pdf version to my email  here.so that it would not cost me
 money. :-)
 
 or maybe I can find it in  rapidshare,maybe there is someone kind
 enough that put the book there.
 -- 
 http://jacobian.web.id
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using categories with Lucene

2010-08-08 Thread Otis Gospodnetic
Hello Luan,

I think you are looking for facets and faceted search.  In short, it means 
storing the category for a document (web page) in the Document Field in Lucene 
index .  Then, at search time, you count how many matches were in which 
category.  You can implement this yourself or you can use Solr, which has this 
functionality built-in.  If you want to stick with Lucene and don't want Solr, 
you can use Bobo Browse with Lucene - Lucene in Action 2 has a case study about 
Bobo Browse, where you can learn how this is done.  Slick stuff.

Thanks for using http://search-lucene.com :)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Luan Cestari luan.cest...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Sun, August 8, 2010 7:16:05 PM
 Subject: Using categories with Lucene
 
 
 Lucene developers, 
 
 We’ve been working on a undergraduate project to  the college about changing
 Apache Nutch (that uses Lucene do index it’s web  pages) to include a
 category filter, and we are having problems about the  query part. We want to
 develop an application with a good performance, so we  thought that here
 would be the best place to ask this kind of question. The  idea is that the
 user can search pages stored for only a category. So the  number of results
 found should display the number of pages that actually is  classified in that
 category.
 
 The problem is about how to add to the  Lucene indexes the category
 information, and how filter the search on that.  We tried to look on the
 Nutch mailing-list (Nabble) about that and asked some  help, but people from
 there think that we should use some plug-in like  Carrot, that get like 100
 of pages and classify it in the query time. We are  not very confident that
 it’s the best solution. We thought in other two  different ideas: #1 To
 classify those pages and store that information on a  DB and in the query
 time filter the result that DB to filter the result. #2  Use different index
 servers, one for each category and one to search without  filtering by
 category.
 
 We have seen on this project http://search-lucene.com/ that there are
 pre-defined categories. We think that this should be  classified at indexing
 time, as we wanted.
 
 Do you have any other idea  about how to do that? 
 
 Sincerely,
 
 Daniel Costa Gimenes  Luan  Cestari
 Undergraduate students of University Center of FEI
 Brazil
 -- 
 View this message in context: 
http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html

 Sent  from the Lucene - Java Users mailing list archive at  Nabble.com.
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)

2010-08-07 Thread Otis Gospodnetic
Utku, you should ask via comments on 
https://issues.apache.org/jira/browse/LUCENE-2453.
What happened with Lucandra?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Utku Can Topçu u...@topcu.gen.tr
 To: java-user@lucene.apache.org
 Sent: Fri, July 23, 2010 12:59:36 PM
 Subject: LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)
 
 Hi All,
 
 I'm trying to use the patch for testing, provided in the  issue.
 
 I downloaded the patch and the dependency *LUCENE-2453
 https://issues.apache.org/jira/browse/LUCENE-2453*.
 I tested this  contribution against the r942817 revision where I assume the
 contributor has  been using during the time of development. The tests seemed
 to  fail.
 
 This time, I updated the CassandraDirectory.java to match the new  Cassandra
 Interface. It unfortunately failed again.
 
 Has anyone here  have an idea on which cassandra revision and lucene revision
 this patch  works  against?
 
 Best Regards,
 Utku


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Personal Intro and a question on find top 10 similar items functionality

2010-07-08 Thread Otis Gospodnetic
Igor,

You can treat that question as the query and use it to search the index where 
you've indexed other questions.
More Like This is another option.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Igor Chudov ichu...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, July 8, 2010 6:12:37 PM
 Subject: Personal Intro and a question on find top 10 similar items  
functionality
 
 Hello,
 
 My name is Igor and I own a website algebra.com. I just  joined.
 
 I have a database of answered algebra questions (208,000 and  growing).
 
 A typical question is here (original spelling):
 
 ``who  long does it take 2 people to finish painting a house if the
 first one takes  6 days and the second one takes 9 days''
 
 What I would like to do is, for  anyone viewing a archived problem, to
 find top 10 similar problems that  would be most similar to the
 currently viewed query. Note that meaning of  similar is not defined in
 my question.
 
 Is Lucene even capable of this  sort of thing?
 
 Could I expect reasonable performance (under 1-2 seconds)  from it?
 
 thanks a bunch  guys.
 
 i
 
 -
 To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: arguments in favour of lucene over commercial competition

2010-06-24 Thread Otis Gospodnetic
And I was just thinking the other day how it would be cool to take, say, Lucene 
1.4, then some 2.* version and now the latest 3.* version and compare. :)
Want to do it and share?  I don't think anyone has done this before.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: jm jmugur...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, June 24, 2010 3:50:58 AM
 Subject: Re: arguments in favour of lucene over commercial competition
 
 I want to add some perf numbers too, to show how it has improved in
the last 
 versions (not that it was bad before) does anyone have a link
to a nice page 
 with numbers/graphs ?

On Thu, Jun 24, 2010 at 7:43 AM, Otis 
 Gospodnetic

 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com 
 wrote:
 Coincidentally, just after I replied to this thread I received an 
 email from one of our customers.  In that email was a quote from one of the 
 commercial search vendors.  My jaw didn't drop because I've seen similar 
 numbers 
 from other commercial search vendors before, but I won't mention the 
 customer nor the vendor, but I can tell you that the amount could put a 
 couple 
 of kids through a top-notch private college in the U.S.  Talking about TOC 
 reduction through use of open-source!
  Otis
 
 
 Sematext :: 
 http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem 
 search :: 
 http://search-lucene.com/



 - Original 
 Message 
 From: jm 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 To: 
 ymailto=mailto:java-user@lucene.apache.org; 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
 Sent: Wed, June 23, 2010 5:57:32 PM
 Subject: Re: arguments in favour 
 of lucene over commercial competition

 yes, in my case 
 the competition is one of the list...

 On Wed, Jun 
 23,
 2010 at 11:41 PM, Otis Gospodnetic
 
 
 ymailto=mailto:
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com
 
 href=mailto:
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com
 ymailto=mailto:otis_gospodne...@yahoo.com; 
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com
 
 wrote:
 Off the top of my head:

 
 FAST

 Endeca
 Coveo
 
 Attivio
 Vivisimo
 Google Search
 
 Appliance
 (tell me when to stop)
 Dieselpoint
 
 IBM
 OmniFind
 Exalead
 Autonomy
 
 dtSearch
 ISYS

 Oracle
 
 ...
 ...

  Otis
 
 

 Sematext ::
 
 href=http://sematext.com/; target=_blank http://sematext.com/ :: Solr - 
 Lucene - Nutch
 Lucene ecosystem
 search ::
 
 
 http://search-lucene.com/



 
 - Original
 Message 
 From: Hans Merkl 
 
 ymailto=mailto:
 href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us
 
 href=mailto:
 href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us
 ymailto=mailto:hme...@rightonpoint.us; 
 href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us

 
 To: java-user 
 href=mailto:
 ymailto=mailto:java-user@lucene.apache.org; 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 ymailto=mailto:java-user@lucene.apache.org; 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org

 
 Sent: Wed, June 23, 2010 5:15:46 PM
 Subject: Re: arguments in 
 favour
 of lucene over commercial 
 competition

 Just curious. What
 
 commercial alternatives are out there?

 On Wed, 
 Jun
 23,
 2010 at 04:01, jm 
 
 href=mailto:
 ymailto=mailto:
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 
 href=mailto:
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 
 ymailto=mailto:
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 
 href=mailto:
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 
 wrote:


 
 Hi,

 I am trying
 to compile some 
 arguments in favour of lucene
 as

 
 management is deciding weather to standardize on lucene or 
 a

 competing
 commercial product (we 
 have a couple of produc, one
 using
 
 lucene,
 another using commercial product, imagine
 
 what am i using). I
 searched
 the lists but could 
 not
 find any post, although I remember
 seeing 
 such
 posts in
 the 
 past.

 Does somebody kept such
 
 posts
 linked or something? Or does someone
 know of 
 some page that
 would
 help 
 me?

 I would like to
 
 show:
 - traction of lucene,
 really improving a 
 lot last
 couple of years
 - rich 
 ecosystem
 (solr...)
 -
 references of 
 other companies choosing lucene/solr over

 
 commercial
 (be it Fast or
 
 whatever)



 
 thanks



 
 -

 
 To
 unsubscribe, e-mail:
 
 href=mailto:
 ymailto=mailto:
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Lucene/Solr choice typically means:

* lower cost of ownership (think about various crazy licensing models some of 
the commercial search vendors have: per doc, per server, per query, per 
year)

* faster implementation (just think about the duration of the sales/negotiation 
phase for commercial search vendors)

* flexibility -- it's open source, you can change whatever you want.  Try that 
with closed-source commercial search vendor's package.

* super fast and knowledgeable community  -- see 
http://www.jroller.com/otis/entry/lucene_solr_nutch_amazing_tech

* commercial support and experts still available -- see 
http://www.sematext.com/services/index.html

* adoption - small companies, medium companies, HUGE companies, secret 
organizations, everyone's using some form of Lucene -- see 
http://wiki.apache.org/lucene-java/PoweredBy , 
http://wiki.apache.org/solr/PublicServers

* maturity - Lucene is over 10 years old.  Solr is over 4 years old.

* future - look at JIRA, look at mailing list traffic, look at pace of 
development, look at CHANGES.txt

* searchable documentation and mailing list archives  -- 
http://search-lucene.com/


* ...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: jm jmugur...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Wed, June 23, 2010 4:01:05 AM
 Subject: arguments in favour of lucene over commercial competition
 
 Hi,

I am trying to compile some arguments in favour of lucene 
 as
management is deciding weather to standardize on lucene or a 
 competing
commercial product (we have a couple of produc, one using 
 lucene,
another using commercial product, imagine what am i using). I 
 searched
the lists but could not find any post, although I remember seeing 
 such
posts in the past.

Does somebody kept such posts linked or 
 something? Or does someone
know of some page that would help me?

I 
 would like to show:
- traction of lucene, really improving a lot last couple 
 of years
- rich ecosystem (solr...)
- references of other companies 
 choosing lucene/solr over commercial
(be it Fast or 
 whatever)

thanks

-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Off the top of my head:

FAST
Endeca
Coveo
Attivio
Vivisimo
Google Search Appliance
(tell me when to stop)
Dieselpoint
IBM OmniFind
Exalead
Autonomy
dtSearch
ISYS
Oracle
...
...

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Hans Merkl hme...@rightonpoint.us
 To: java-user java-user@lucene.apache.org
 Sent: Wed, June 23, 2010 5:15:46 PM
 Subject: Re: arguments in favour of lucene over commercial competition
 
 Just curious. What commercial alternatives are out there?

On Wed, Jun 23, 
 2010 at 04:01, jm 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com wrote:

 
 Hi,

 I am trying to compile some arguments in favour of lucene 
 as
 management is deciding weather to standardize on lucene or a 
 competing
 commercial product (we have a couple of produc, one using 
 lucene,
 another using commercial product, imagine what am i using). I 
 searched
 the lists but could not find any post, although I remember 
 seeing such
 posts in the past.

 Does somebody kept such 
 posts linked or something? Or does someone
 know of some page that would 
 help me?

 I would like to show:
 - traction of lucene, 
 really improving a lot last couple of years
 - rich ecosystem 
 (solr...)
 - references of other companies choosing lucene/solr over 
 commercial
 (be it Fast or whatever)

 
 thanks

 
 -
 To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 
 For additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
I won't comment on Attivio, as I think I might have signed some NDA with them.  
But they do claim to combine full-text search with DB-like joins.  Can't 
MarkLogic do that, too?


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Itamar Syn-Hershko ita...@code972.com
 To: java-user@lucene.apache.org
 Sent: Wed, June 23, 2010 5:54:34 PM
 Subject: RE: arguments in favour of lucene over commercial competition
 
 Otis, I'm 99% sure Attivio is just a wrapper arround Lucene...

And I 
 personally wouldn't count full text search solutions such as 
 Oracle's.

Itamar.

 -Original Message-
 From: 
 Otis Gospodnetic [mailto:
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com] 
 
 Sent: Thursday, June 24, 2010 12:42 AM
 To: 
 ymailto=mailto:java-user@lucene.apache.org; 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
 Subject: Re: arguments in favour of lucene over commercial competition
 
 
 Off the top of my head:
 
 FAST
 Endeca
 
 Coveo
 Attivio
 Vivisimo
 Google Search Appliance
 
 (tell me when to stop)
 Dieselpoint
 IBM OmniFind
 
 Exalead
 Autonomy
 dtSearch
 ISYS
 Oracle
 
 ...
 ...
 
  Otis
 
 Sematext :: 
 href=http://sematext.com/; target=_blank http://sematext.com/ :: Solr - 
 Lucene - Nutch 
 Lucene ecosystem search :: 
 href=http://search-lucene.com/; target=_blank 
 http://search-lucene.com/
 
 
 
 - Original 
 Message 
  From: Hans Merkl 
 ymailto=mailto:hme...@rightonpoint.us; 
 href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us
  
 To: java-user 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
  Sent: Wed, June 23, 2010 5:15:46 PM
  Subject: Re: arguments in 
 favour of lucene over commercial 
 competition
  
  
 Just curious. What commercial alternatives are out there?
 
 On 
 Wed, Jun 23, 
  2010 at 04:01, jm 
  href=mailto:
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com wrote:
 
 
  
  Hi,
 
  I am trying to compile 
 some arguments in favour of lucene as 
  management is deciding 
 weather to standardize on lucene or 
 a competing 
  
 commercial product (we have a couple of produc, one using lucene, 
  
 another using commercial product, imagine what am i using). 
 I searched 
 
  the lists but could not find any post, although I remember 
 
 seeing such 
  posts in the past.
 
  
 Does somebody kept such
  posts linked or something? Or does someone 
 know of some page that 
  would help me?
 
  I 
 would like to show:
  - traction of lucene,
  really 
 improving a lot last couple of years
  - rich ecosystem
  
 (solr...)
  - references of other companies choosing lucene/solr over 
 
 commercial 
  (be it Fast or whatever)
 
 
  
  thanks
 
  
  
 
 -
 
  To
  unsubscribe, e-mail: 
  
 href=mailto:
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.orgjava-user-unsubs
 
  
 href=mailto:cr...@lucene.apache.org;cr...@lucene.apache.org
  
 
  For additional commands, e-mail: 
  ymailto=mailto:
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
  
 
  
 href=mailto:
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
 ymailto=mailto:java-user-h...@lucene.a; 
 href=mailto:java-user-h...@lucene.a;java-user-h...@lucene.a
  
 pache.org
 
 
 
 
 -
 To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 
 For additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
 
 
 
 


-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Coincidentally, just after I replied to this thread I received an email from 
one of our customers.  In that email was a quote from one of the commercial 
search vendors.  My jaw didn't drop because I've seen similar numbers from 
other commercial search vendors before, but I won't mention the customer 
nor the vendor, but I can tell you that the amount could put a couple of kids 
through a top-notch private college in the U.S.  Talking about TOC reduction 
through use of open-source!
 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: jm jmugur...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Wed, June 23, 2010 5:57:32 PM
 Subject: Re: arguments in favour of lucene over commercial competition
 
 yes, in my case the competition is one of the list...

On Wed, Jun 23, 
 2010 at 11:41 PM, Otis Gospodnetic

 ymailto=mailto:otis_gospodne...@yahoo.com; 
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com 
 wrote:
 Off the top of my head:

 FAST
 
 Endeca
 Coveo
 Attivio
 Vivisimo
 Google Search 
 Appliance
 (tell me when to stop)
 Dieselpoint
 IBM 
 OmniFind
 Exalead
 Autonomy
 dtSearch
 ISYS
 
 Oracle
 ...
 ...

  Otis
 
 
 Sematext :: 
 http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem 
 search :: 
 http://search-lucene.com/



 - Original 
 Message 
 From: Hans Merkl 
 ymailto=mailto:hme...@rightonpoint.us; 
 href=mailto:hme...@rightonpoint.us;hme...@rightonpoint.us
 
 To: java-user 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
 Sent: Wed, June 23, 2010 5:15:46 PM
 Subject: Re: arguments in favour 
 of lucene over commercial competition

 Just curious. What 
 commercial alternatives are out there?

 On Wed, Jun 
 23,
 2010 at 04:01, jm 
 href=mailto:
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com
 ymailto=mailto:jmugur...@gmail.com; 
 href=mailto:jmugur...@gmail.com;jmugur...@gmail.com 
 wrote:


 Hi,

 I am trying 
 to compile some arguments in favour of lucene
 as
 
 management is deciding weather to standardize on lucene or a
 
 competing
 commercial product (we have a couple of produc, one 
 using
 lucene,
 another using commercial product, imagine 
 what am i using). I
 searched
 the lists but could not 
 find any post, although I remember
 seeing such
 posts in 
 the past.

 Does somebody kept such
 posts 
 linked or something? Or does someone
 know of some page that 
 would
 help me?

 I would like to 
 show:
 - traction of lucene,
 really improving a lot last 
 couple of years
 - rich ecosystem
 (solr...)
 - 
 references of other companies choosing lucene/solr over
 
 commercial
 (be it Fast or 
 whatever)


 
 thanks


 
 -
 
 To
 unsubscribe, e-mail:
 href=mailto:
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org

 
 For additional commands, e-mail:
 ymailto=mailto:
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
 
 href=mailto:
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org



 
 -
 To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 
 For additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org



-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Monitoring low level IO

2010-06-04 Thread Otis Gospodnetic
Ah, there is another one I came across several months back - 
http://wiki.sdn.sap.com/wiki/display/Java/JPicus.

 
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 To: java-user@lucene.apache.org
 Sent: Fri, June 4, 2010 1:54:15 AM
 Subject: Re: Monitoring low level IO
 
 Other than iostat, vmstat and such?
Otis




- Original 
 Message 
 From: Jason Rutherglen 
 ymailto=mailto:jason.rutherg...@gmail.com; 
 href=mailto:jason.rutherg...@gmail.com;jason.rutherg...@gmail.com
 
 To: 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
 Sent: Thu, June 3, 2010 2:13:17 PM
 Subject: Monitoring low level 
 IO
 
 This is more of a unix related question than Lucene 
 specific
however because 
 Lucene is being used, I'm asking here as 
 perhaps
other people have run into a 
 similar issue.

On an 
 Amazon EC2 merge, read, and write operations are 
 possibly
blocking 
 due to underlying IO. Is there a tool that you have
used 
 to monitor 
 this type of 
 
 thing?

-
To 
 
 unsubscribe, e-mail: 
 href=mailto:
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 
 additional commands, e-mail: 
 ymailto=mailto:
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
  
 
 href=mailto:
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: is there any resources that explain detailed implementation of lucene?

2010-06-03 Thread Otis Gospodnetic
Li Li:

Then best to go to the source.
Here's one version with syntax highlighting and line numbers, should you have 
questions about specific parts of that class:

http://search-lucene.com/c/Lucene:/src/java/org/apache/lucene/search/PhraseQuery.java

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Li Li fancye...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, June 3, 2010 2:51:02 AM
 Subject: Re: is there any resources that explain detailed implementation of  
 lucene?
 
 e.g. I want to know the code under phrase query so that I can make
some 
 extenstion.

2010/6/3 Erick Erickson 
 ymailto=mailto:erickerick...@gmail.com; 
 href=mailto:erickerick...@gmail.com;erickerick...@gmail.com:
 
 Why do you care (tm)? Or, put another way, are you asking just
 for 
 general understanding of how Lucene works or is there a
 higher-level 
 problem you're trying to solve?

 Best
 
 Erick

 On Wed, Jun 2, 2010 at 8:54 PM, Li Li 
 ymailto=mailto:fancye...@gmail.com; 
 href=mailto:fancye...@gmail.com;fancye...@gmail.com 
 wrote:

 such as the detailed process of store data 
 structures, index, search
 and sort. not just apis. 
 thanks.

 
 -
 
 To unsubscribe, e-mail: 
 ymailto=mailto:java-user-unsubscr...@lucene.apache.org; 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 
 For additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org




-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: numDeletedDocs()

2010-06-03 Thread Otis Gospodnetic
Btw. folks, http://search-lucene.com/ has a really handy source code search 
with auto-completion for Lucene, Solr, etc.  For example, I typed in: numDel  - 
and immediately found those methods.  Use it. :)

Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-user@lucene.apache.org
 Sent: Thu, June 3, 2010 4:02:09 PM
 Subject: Re: numDeletedDocs()
 
 Hmm... I don't think IndexWriter has ever had a numDeletedDocs() (w/ no 
 params)?

(IndexReader does).

Mike

On Thu, Jun 3, 2010 at 
 3:50 PM, Woolf, Ross 
 href=mailto:ross_wo...@bmc.com;ross_wo...@bmc.com wrote:
 There 
 seems to be a mismatch between the IndexWriter().numDeletedDocs() method as 
 stated in the javadocs supplied in the 2.9.2 download and what is 
 actual.

 JavaDocs for 2.9.2 as came with the 2.9.2 
 download

 numDeletedDocs
 public int 
 numDeletedDocs()Returns the number of deleted documents.  (No parameter 
 required)

 --
 Source code for 
 2.9.2
  public int numDeletedDocs(SegmentInfo info) throws IOException { 
  (Parameter required)


 Why is there no longer a no 
 parameter numDeleteDocs as stated in the JavaDocs?  I'm not sure how I use 
 the 
 experimental SegmentInfo just to get the delete count in my index?  Any 
 help 
 appreciated.


 
 -
 To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 
 For additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org



-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Monitoring low level IO

2010-06-03 Thread Otis Gospodnetic
Other than iostat, vmstat and such?
 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Jason Rutherglen jason.rutherg...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, June 3, 2010 2:13:17 PM
 Subject: Monitoring low level IO
 
 This is more of a unix related question than Lucene specific
however because 
 Lucene is being used, I'm asking here as perhaps
other people have run into a 
 similar issue.

On an Amazon EC2 merge, read, and write operations are 
 possibly
blocking due to underlying IO. Is there a tool that you have
used 
 to monitor this type of 
 thing?

-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wich way would you recommend for successive-words similarity and scoring ?

2010-06-01 Thread Otis Gospodnetic
Hi Pablo,

This question comes up every once in a while.  You'll find some previous 
discussions and answers here: 
http://search-lucene.com/?q=terms+closer+together+score

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Pablo pablo.queixa...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Mon, May 3, 2010 3:20:10 PM
 Subject: Wich way would you recommend for successive-words similarity and  
 scoring ?
 
 Hello,

Lucene core doesn't seems to use relative word positioning (?) for 
 scoring.

For example, indexing that phrase a b c d e f g h i j k l m n o 
 p q r
s t u v w x y z, these queries give the same results (0.19308087) 
 :
 - 1 : phrase:'e f g'
 - 2 : phrase:'o k z'

I'm a bit familiar 
 with lucene and snowballs, but I never (really)
needed this feature before, 
 and didn't browse the lucene contribs.

Maybe I'm misunderstanding 
 something, but, what can I do to obtain
query 1 get a better score than the 
 second ?

Should I implement a Scorer and or a Similarity, or can an 
 analyser
and a specific stemmer be sufficient?


Thanks, [I first 
 wrote to dev, wasn't the right 
 place.]

Pablo

-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping or de-duping

2010-05-31 Thread Otis Gospodnetic
Pasa,

Maybe Field Collapsing (Solr) can help? See SOLR-236 in JIRA

http://search-lucene.com/?q=field+collapsingfc_project=Lucenefc_project=Solr

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Паша Минченков char...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Mon, May 31, 2010 4:15:40 PM
 Subject: Grouping or de-duping
 
 Sorry for my similar questions. I need to remove duplicates from 
 search
results for a given field (or group by). Documents on this field are 
 not
ordered. Which one will get duplicates in search results - I do not care. 
 I
tried to use DuplicateFilter and PerParentLimitedQuery, but they 
 didn't
help. In searching for an answer I found references 
 to
SimpleFacetParameters, but I do not understand how this material can 
 be
useful to me because it refers to the project Solr. Maybe someone has 
 an
example of grouping searh result or something like DeDupinQuery.

On 
 the link below, I found a solution, but there is no sample and I can't
make 
 these modifications my self.

 href=http://markmail.org/message/uvrh3y5ogjgu4gfx#query:group%20lucene%20results%20by%20field+page:1+mid:uvrh3y5ogjgu4gfx+state:results;
  
 target=_blank 
 http://markmail.org/message/uvrh3y5ogjgu4gfx#query:group%20lucene%20results%20by%20field+page:1+mid:uvrh3y5ogjgu4gfx+state:results

Thanks.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is Lucene a document oriented database?

2010-05-31 Thread Otis Gospodnetic
I think those doc-oriented DBs tend to be distributed, with replication 
built-in and such, but yes, in some way the schemaless DB with docs and fields 
(whether they are pumped in as JSON or XML or Java objects) feels the same.  I 
saw something from Grant about 2 months ago how Lucene is nosql-ish.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Shashi Kant sk...@sloan.mit.edu
 To: java-user@lucene.apache.org
 Sent: Mon, May 31, 2010 12:20:36 PM
 Subject: Is Lucene a document oriented database?
 
 There seems to be considerable buzz on the internets about document
oriented 
 dbs such as MongoDB, CouchDB etc. I am at a loss as to what
are the principal 
 differences between Lucene and the DODBs. I could
very use Lucene as any of 
 the above (schema-free, Document oriented)
and perform similar queries, 
 *with* the added benefit of text search.

I fail to see what benefits such 
 DoDBs bring, or is it old wine in new 
 bottles?

Thanks
Shashi

-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using JSON for index input and search output

2010-05-31 Thread Otis Gospodnetic
VL,

Solr (not Lucene, but you can embed Solr) has JsonUpdateRequestHandler, which 
lets you send docs to Solr for indexing in JSON (instead of the usual XML):
http://search-lucene.com/c/Solr:/src/java/org/apache/solr/handler/JsonUpdateRequestHandler.java


And you can get Solr to respond with JSON, as you pointed out:
http://wiki.apache.org/solr/SolJSON

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Visual Logic visual.lo...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Sun, May 30, 2010 1:33:19 PM
 Subject: Using JSON for index input and search output
 
 Lucene,

JSON is the format used for all the configuration and property 
 files in the RIA application we are developing. Is Lucene able to create a 
 document from a given JSON file and index it? Is Lucene able to provide a 
 JSON 
 output response from a query made to an index? Does the Tika package provide 
 this?

Local indexing and searching is needed on the local client so Solr 
 is not a solution even though it does provide a search response in JSON 
 format.

VL
-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TermsFilter instead of should TermQueries

2010-05-09 Thread Otis Gospodnetic
I think what Tomislav was trying to ask is:

Can filters replace only strictly boolean clauses (i.e. only MUST and 
MUST_NOT), such as: +gender:F, -rating:xxx)?
Or can filters also replace SHOULD clauses, such as: food:banana (which is 
neither absolutely required or strictly prohibited)?

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Erick Erickson erickerick...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Fri, May 7, 2010 8:30:18 PM
 Subject: Re: TermsFilter instead of should TermQueries
 
 Well, you construct the filter by enumerating the terms you're
interested in 
 and pass it along to the relevant search.

But it looks like you've 
 figured that part out. If you're asking
how can you use a Filter and still 
 have the terms replaced
by the filter contribute to scoring, you can't. But 
 it's a reasonable
question to ask whether it changes the score enough 
 to
matter given that this is only a problem when there are 
 many
terms.

If this doesn't speak to your question, can you ask for 
 more
detail?

HTH
Erick

On Fri, May 7, 2010 at 1:19 PM, 
 Tomislav Poljak 
 href=mailto:tpol...@gmail.com;tpol...@gmail.com wrote:

 
 Hi,
 in API documentation for TermsFilter:


 
 href=http://search-lucene.com/jd/lucene/org/apache/lucene/search/TermsFilter.html;
  
 target=_blank 
 http://search-lucene.com/jd/lucene/org/apache/lucene/search/TermsFilter.html

 
 it states:

 'As a filter, this is much faster than the equivalent 
 query (a
 BooleanQuery with many should TermQueries)'

 I 
 would like to replace should TermQueries with TermsFilter to benefit
 
 in performance, but I'm trying to understand how this change/switch can
 
 work.

 I was under the impression that the BooleanQuery with many 
 should
 TermQueries affects scoring like: each should term present in 
 result,
 increases the result's score.

 If someone could 
 explain how can a TermsFilter (which is like any filter
 a binary thing - 
 result document is matched or not) be used to replace
 should clauses, I 
 would really appreciate it.

 Tomislav


 
 -
 To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
 
 For additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Filter vs. TermQuery performance

2010-05-09 Thread Otis Gospodnetic
I think others will have more thoughts on this, esp. for Numeric* questions... 
but I'll try answering...
 

- Original Message 
 From: Tomislav Poljak tpol...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Fri, May 7, 2010 2:34:46 PM
 Subject: Filter vs. TermQuery performance
 
 Hi,
 when is it wise to replace a TermQuery with cached Filter 
 (regarding search performance). If TermQuery is used only to filter results 
 based on field value (it doesn't participate in scoring), is it alway wise 
 to replace it with filter? 

Yes, assuming the filter will be reused.  I think there is not a lot of value 
in using a filter (vs. just a regular query) if that filter will not be reused. 
 This is why in Solr fqs (filtered queries) are cached in a special filter 
cache.  I *think* the only other benefit of using a filter query vs., say, 
TermQuery, is that the former will not spend any time/CPU on computing the 
score for the filter part.

 Is it only wise if Filter is cached (wrapped in CachingWrapperFilter) and 
 reused often?

I think so.  See above.

 Does it matter how many 
 distinct values field has (which is related to how many matches/results for 
 one given/selected value is returned and also with how many times same filter 
 instance is reused)?

I *think* it matters.  I think the more docs a filter matches, the higher the 
benefit from reusing a filter.

 For example, what if filter for single value matches 
 only 5% of docs, should filter be used or is it better to use TermQuery? 
 What about if filter for single value matches 20%? or 50% or 
 75%

I'm not sure...

 I have a question regarding caching performance/memory usage. 
 Documents have datetime indexed (as NumericField) with minute resolution 
 and there are few thousands unique datetime in index. On the search 
 side open ended range filter is used (NumericRangeFilter) with current 
 time as a parameter.

 Now, is it wise to cache NumericRangeFilter here 
 (reuse instance of CachingWrapperFilter wrapping NumericRangeFilter) since it 
 will not be reused often (only from users searching at same time in same time 
 zone)?

If the cache hit rate is low, why waste memory on caching is what I would think 
is the logic to apply here.
If you have 3 queries, and each uses a different date range query, then you 
will not see benefits from caching..
If 2 of those 3 queries use the exact same date range query, then you will see 
caching benefits.

 Is it better to use NumericRangeFilter or NumericRangeQuery in this case?

I'm not sure, but I'd be happy to add specific advice to Javadoc when the 
answer is clear.

Otis

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucandra - Lucene/Solr on Cassandra: April 26, NYC

2010-04-22 Thread Otis Gospodnetic
Hello folks,

Those of you in or near NYC and using Lucene or Solr should come to Lucandra - 
a Cassandra-based backend for Lucene and Solr on April 26th:

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/

The presenter will be Lucandra's author, Jake Luciani.

Please spread the word.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Range Query Assistance

2010-04-21 Thread Otis Gospodnetic
Joseph,

If you can, get the latest Lucene and use NumericField to index your dates with 
appropriate precision and then use NumericRangeQueries when searching.  This 
will be faster than searching for string dates in a given range.
 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: i...@josephcrawford.com i...@josephcrawford.com
 To: java-user@lucene.apache.org
 Sent: Fri, April 16, 2010 9:23:30 AM
 Subject: Range Query Assistance
 
 Hello,

I would like to query based on a start and end date.  I was 
 thinking
something like this

start_date: [2101 TO todays 
 date] end_date: [todays date TO
20900101]

Would this work 
 for me?  Our dates are stored in the index as strings so I
am not sure 
 the syntax above would be correct.

Any assistance would be 
 appreciated.

Thanks,
Joseph 
 Crawford

-
To 
 unsubscribe, e-mail: 
 href=mailto:java-user-unsubscr...@lucene.apache.org;java-user-unsubscr...@lucene.apache.org
For 
 additional commands, e-mail: 
 ymailto=mailto:java-user-h...@lucene.apache.org; 
 href=mailto:java-user-h...@lucene.apache.org;java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NumericField indexing performance

2010-04-15 Thread Otis Gospodnetic
Hi,

I actually don't follow your change, because after but changing it to line 
the only different thing I see is the doc.add(dateField) call, which you didn't 
list before but changing it to.

Also, if I understood Uwe correctly, he was suggesting reusing NumericField 
instances, which means new NumericField(date) should exist and be called 
for only *once* in your code.  The same for Document instances.  GC threads 
will thank you and Uwe for this change.
 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Tomislav Poljak tpol...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, April 15, 2010 7:41:02 AM
 Subject: RE: NumericField indexing performance
 
 Hi Uwe,
thank you very much for your answers. I've done Document 
 and
NumericField reuse like this:

Document doc = 
 getDocument();
NumericField dateField = new NumericField(date);

for 
 each 
 doc:

doc.add(dateField.setLongValue(Long.parseLong(DateTools.dateToString(date), 
 DateTools.Resolution.MINUTE;

,but changing it to:

Document doc 
 = getDocument();
NumericField dateField = new 
 NumericField(date);
doc.add(dateField);

for each 
 doc:

dateField.setLongValue(Long.parseLong(DateTools.dateToString(date),
DateTools.Resolution.MINUTE)));

did 
 the trick. Now indexing with NumericField takes minutes, not 
 hours.

Thanks again,

Tomislav





On Wed, 
 2010-04-14 at 23:38 +0200, Uwe Schindler wrote:
 One addition:
 If 
 you are indexing millions of numeric fields, you should also try to reuse 
 NumericField and Document instances (as described in JavaDocs). NumericField 
 creates internally a NumericTokenStream and lots of small objects 
 (attributes), 
 so GC cost may be high. This is just another idea.
 
 Uwe
 
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 
 Bremen
 
 http://www.thetaphi.de
 eMail: 
 href=mailto:u...@thetaphi.de;u...@thetaphi.de
 
 
  
 -Original Message-
  From: Uwe Schindler [mailto:
 ymailto=mailto:u...@thetaphi.de; 
 href=mailto:u...@thetaphi.de;u...@thetaphi.de]
  Sent: Wednesday, 
 April 14, 2010 11:28 PM
  To: 
 ymailto=mailto:java-user@lucene.apache.org; 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
  Subject: RE: NumericField indexing performance
  
  
 Hi Tomislav,
  
  indexing with NumericField takes longer 
 (at least for the default
  precision step of 4, which means out of 
 32 bit integers make 8 subterms
  with each 4 bits of the value). So 
 you produce 8 times more terms
  during indexing that must be handled 
 by the indexer. If you have lots
  of documents, with distinct values 
 the term index gets larger and
  larger, but search performance 
 increases dramatically (for
  NumericRangeQueries). So if you index 
 *only* numeric fields and nothing
  else, a 8 times slower indexing 
 can be true.
  
  If you are not using NumericRangeQuery 
 or you want tune indexing
  performance, try larger precision Steps 
 like 6 or 8. If you don’t use
  NumericRangeQuery and only want to 
 index the numeric terms as *one*
  term, use 
 precStep=Integer.MAX_VALUE. Also check your memory
  requirements, as 
 the indexer may need more memory and GC costs too
  much. Also the 
 index size will increase, so lots of more I/O is done.
  Without more 
 details I cannot say anything about your configuration. So
  please 
 tell us, how many documents, how many fields and how many
  numeric 
 fields in which configuration do you use?
  
  Uwe
 
  
  -
  Uwe Schindler
  
 H.-H.-Meier-Allee 63, D-28213 Bremen
  
 href=http://www.thetaphi.de; target=_blank http://www.thetaphi.de
 
  eMail: 
 href=mailto:u...@thetaphi.de;u...@thetaphi.de
  
  
 
   -Original Message-
   From: Tomislav 
 Poljak [mailto:
 href=mailto:tpol...@gmail.com;tpol...@gmail.com]
   Sent: 
 Wednesday, April 14, 2010 8:13 PM
   To: 
 ymailto=mailto:java-user@lucene.apache.org; 
 href=mailto:java-user@lucene.apache.org;java-user@lucene.apache.org
 
   Subject: NumericField indexing performance
  
 
   Hi,
   is it normal for indexing time to increase up to 
 10 times after
   introducing NumericField instead of Field (for 
 two fields)?
  
   I've changed two date fields 
 from String representation (Field) to
   NumericField, now it 
 is:
  
   doc.add(new 
 NumericField(time).setIntValue(date.getTime()/24/3600))
  
 
   and after this change indexing took 10x more time (before 
 it was few
   minutes and after more than an hour and half). I've 
 tested with a
   simple
   counter like 
 this:
  
   doc.add(new 
 NumericField(endTime).setIntValue(count++))
  
  
  but nothing changed, it still takes around 10x longer. If I comment
 
   adding one numeric field to index time drops significantly and if 
 I
   comment both fields indexing takes only few minutes 
 again.
  
   Tomislav
  
 
  
   
 -
 
   To unsubscribe, e-mail: 

Re: Searching Subversion comments:

2010-03-08 Thread Otis Gospodnetic
Hi Erick,

For what it's worth, we are considering indexing JIRA comments over on 
http://search-lucene.com/ , though I'm not entirely convinced searching in 
comments would be super valuable.  Would it?

But note that JIRA (and LucidFind) already do that.  For example, go to 
http://issues.apache.org/jira/browse/LUCENE-2061 and search for Attached first 
cut python script nrtBench.py.~10 (it's in that issue's comments) and JIRA 
will find that issue.

What exactly are you lokoing to do/build?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Erick Erickson erickerick...@gmail.com
 To: java-user java-user@lucene.apache.org
 Sent: Mon, March 8, 2010 3:48:41 PM
 Subject: Searching Subversion comments:
 
 Before I reinvent the wheel.
 
 Is there any convenient way to, say, find all the files associated with
 patch ? I realize one can (hopefully) get this information from JIRA,
 but... This is a subset of the problem of searching Subversion comments.
 
 I can see it being useful, especially for people coming into the code fresh.
 Grep (or the equivalent in the IDE) only goes so far. If there's any
 interest, I'm thinking of playing with http://svn-search.sourceforge.net/ to
 see what I could see and report back. It should be easy enough to set up on
 my machine at home, although I'm not set up to show it to others.
 
 And it's even based on Lucene. This is feeling recursive..
 
 Mostly I'm checking to see if something like this has already been done and
 I just missed the boat. Besides, I'm curious...
 
 Erick


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: OutOfMemoryError

2010-03-05 Thread Otis Gospodnetic
Maybe it's not a leak, Monique. :)
If you use sorting in Lucene, then the FieldCache object will keep some data 
permanently in memory, for example.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Monique Monteiro monique.lou...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Fri, March 5, 2010 1:38:31 PM
 Subject: OutOfMemoryError
 
 Hi all,
 
 
 
   I’m new to Lucene and I’m evaluating it in a web application which looks
 up strings in a huge index – the index file contains 32GB. I keep a
 reference to a Searcher object during the application’s lifetime, but this
 object has strong memory requirements and keeps memory consumption around
 950MB.  I did some optimization in order to share some fields in two
 “composed” indices, but in a web application with less than 1GB for JVM,
 OutOfMemoryError is generated. It seems that the searcher keeps some form of
 cache which is not frequently released.
 
 
   I’d like to know if this kind of memory leak is normal according to
 Lucene’s behaviour and if the only available solution is adding memory to
 the JVM.
 
 Thanks in advance!
 
 -- 
 Monique Monteiro, MSc
 IBM OOAD / SCJP / MCTS Web
 Blog: http://moniquelouise.spaces.live.com/
 Twitter: http://twitter.com/monilouise
 MSN: monique_lou...@msn.com
 GTalk: monique.lou...@gmail.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more

2010-03-03 Thread Otis Gospodnetic
Hello folks,

Those of you in or near New York and using Lucene or Solr should come to 
Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more on March 
24th:

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12720960/


The presenter will be the hyper active Lucene committer Robert Muir.

Please spread the word.

Otis
--
Lucene ecosystem search :: http://search-lucene.com/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email Filter using Lucene 3.0

2010-01-29 Thread Otis Gospodnetic
Hi Jamie,

Could you say more about how it's not working?  No compiling? Run-time 
exceptions?  Doesn't work as expected after you run a unit test for it?


Otis 
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Jamie ja...@stimulussoft.com
 To: java-user@lucene.apache.org
 Sent: Fri, January 29, 2010 7:29:13 AM
 Subject: Email Filter using Lucene 3.0
 
 Hi THere
 
 In the absence of documentation, I am trying to convert an EmailFilter class 
 to 
 Lucene 3.0. Its not working! Obviously, my understanding of the new token 
 filter 
 mechanism is misguided.
 Can someone in the know help me out for a sec and let me know where I am 
 going 
 wrong. Thanks.
 
 import org.apache.commons.logging.*;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.Token;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.ArrayList;
 import java.util.Stack;
 
 /* Many thanks to Michael J. Prichard for his
 * original the email filter code. It is rewritten. */
 
 public class EmailFilter extends TokenFilter  implements Serializable {
 
 public EmailFilter(TokenStream in) {
 super(in);
 }
 
 public final boolean incrementToken() throws java.io.IOException {
 
 if (!input.incrementToken()) {
 return false;
 }
 
 
 TermAttribute termAtt = (TermAttribute) 
 input.getAttribute(TermAttribute.class);
 
 char[] buffer = termAtt.termBuffer();
 final int bufferLength = termAtt.termLength();
 String emailAddress = new String(buffer, 0,bufferLength);
 emailAddress = emailAddress.replaceAll(, );
 emailAddress = emailAddress.replaceAll(, );
 emailAddress = emailAddress.replaceAll(\, );
 
 String [] parts = extractEmailParts(emailAddress);
 clearAttributes();
 for (int i = 0; i  parts.length; i++) {
 if (parts[i]!=null) {
 TermAttribute newTermAttribute = 
 addAttribute(TermAttribute.class);
 newTermAttribute.setTermBuffer(parts[i]);
 newTermAttribute.setTermLength(parts[i].length());
 }
 }
 return true;
 }
 
 private String[] extractWhitespaceParts(String email) {
 String[] whitespaceParts = email.split( );
 ArrayListpartsList = new ArrayList();
 for (int i=0; i  whitespaceParts.length; i++) {
 partsList.add(whitespaceParts[i]);
 }
 return whitespaceParts;
 }
 
 private String[] extractEmailParts(String email) {
 
 if (email.indexOf('@')==-1)
 return extractWhitespaceParts(email);
 
 ArrayListpartsList = new ArrayList();
 
 String[] whitespaceParts = extractWhitespaceParts(email);
 
  for (int w=0;w
 
  if (whitespaceParts[w].indexOf('@')==-1)
  partsList.add(whitespaceParts[w]);
  else {
  partsList.add(whitespaceParts[w]);
  String[] splitOnAmpersand = whitespaceParts[w].split(@);
  try {
  partsList.add(splitOnAmpersand[0]);
  partsList.add(splitOnAmpersand[1]);
  } catch (ArrayIndexOutOfBoundsException ae) {}
 
 if (splitOnAmpersand.length  0) {
 String[] splitOnDot = splitOnAmpersand[0].split(\\.);
  for (int i=0; i  splitOnDot.length; i++) {
  partsList.add(splitOnDot[i]);
  }
 }
 if (splitOnAmpersand.length  1) {
 String[] splitOnDot = splitOnAmpersand[1].split(\\.);
 for (int i=0; i  splitOnDot.length; i++) {
 partsList.add(splitOnDot[i]);
 }
 
 if (splitOnDot.length  2) {
 String domain = splitOnDot[splitOnDot.length-2] + . 
 + 
 splitOnDot[splitOnDot.length-1];
 partsList.add(domain);
 }
 }
  }
  }
 return partsList.toArray(new String[0]);
 }
 
 }


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: index demo throws LockObtainFailedException

2010-01-28 Thread Otis Gospodnetic
Fedora Core 4 is *ancient*! :)
Could it be that the NFS client on it is old, and this is causing problems?  I 
remember emails about NFS 3 vs. NFS 4 and some improvements in the latter.  I 
don't recall the details and tend to keep my Lucene and Solr instances away 
from NFS mounts.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Teruhiko Kurosaka k...@basistech.com
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Thu, January 28, 2010 8:15:26 PM
 Subject: index demo throws LockObtainFailedException
 
 We have many Linux machines of different brands, sharing the same NFS 
 filesystem
 for home.  The Lucene file indexing demo program is failing with 
 LockObainFailedException 
 only on one particular Linux machine (Fedora Core 4, x86).  I am including
 the console output at the bottom of this message.
 
 I tried Lucene 2.9.0, 2.9.1 and 3.0.0, and the result is identical.
 
 After searching the Internet, I saw some postings suggesting that this happens
 when the disk space is low. But there seem to be more than enough for this
 small demo.  I didn't understand suggestions about lockd.  I'd appreciate
 for any advices on how to find the cause of this Exception. 
 
 Thank you in advance.
 
 T. Kuro Kurosaka
 
 -bash-3.00$ cd lucene-3.0.0/
 -bash-3.00$ ant demo-index-text
 Buildfile: build.xml
 
 jar.core-check:
 
 compile-demo:
 [mkdir] Created dir: /basis/users/kuro/opt/lucene-3.0.0/build/classes/demo
 [javac] Compiling 17 source files to 
 /basis/users/kuro/opt/lucene-3.0.0/build/classes/demo
 
 jar-demo:
   [jar] Building jar: 
 /basis/users/kuro/opt/lucene-3.0.0/lucene-demos-3.0.0.jar
 
 demo-index-text:
  [echo] - (1) Prepare dir -
  [echo] cd /basis/users/kuro/opt/lucene-3.0.0
  [echo] rmdir demo-text-dir
  [echo] mkdir demo-text-dir
 [mkdir] Created dir: /basis/users/kuro/opt/lucene-3.0.0/demo-text-dir
  [echo] cd demo-text-dir
  [echo] - (2) Index the files located under 
 /basis/users/kuro/opt/lucene-3.0.0/src -
  [echo] java -classpath 
 ../lucene-core-3.0.0.jar;../lucene-demos-3.0.0.jar 
 org.apache.lucene.demo.IndexFiles ../src/demo
  [java]  caught a class org.apache.lucene.store.LockObtainFailedException
  [java]  with message: Lock obtain timed out: 
 NativeFSLock@/basis/users/kuro/opt/lucene-3.0.0/demo-text-dir/index/write.lock:
  
 java.io.IOException: Input/output error
 
 BUILD SUCCESSFUL
 Total time: 6 seconds
 -bash-3.00$ df -k . /tmp
 Filesystem   1K-blocks  Used Available Use% Mounted on
 storev:/vol/exports/users
  3119362560 2790661520 328701040  90% /basis/users
 /dev/sda2  9718360   7700764   1515968  84% /
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Proximity of More than Single Words?

2010-01-21 Thread Otis Gospodnetic
Yes, that's just a phrase slop, allowing for variable gaps between words.
I *believe* the Surround QP that works with Span family of queries does handle 
what you are looking for.


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: T. R. Halvorson t...@midrivers.com
 To: java-user@lucene.apache.org
 Sent: Tue, January 19, 2010 9:40:07 AM
 Subject: Proximity of More than Single Words?
 
 For proximity expressions, the query parser documentation says, use the 
 tilde, 
 ~, symbol at the end of a Phrase. It gives the example jakarta apache~10
 
 Does this mean that proximity can only be operated on single words enquoted 
 in 
 quotation marks? To clarify the question by comparision, on some systems, the 
 w/ 
 proximity operator lets one search for:
 
 crude w/4 west texas
 
 or
 
 spot prices w/3 gulf coast
 
 The Lucene documentation seems to imply that such searches cannot be 
 constructed 
 in any straightforward way (although there might be a way to get the effect 
 by 
 going around Cobb's Hill). Or does the Lucene syntax allow the examples to be 
 cast as:
 
 crude west texas~4
 
 or
 
 spot prices gulf coast~3
 
 If not, is it a fair assessment to say that in Lucene, proximity is limited 
 to 
 being a part of phrase searching, and its function is exhausted by allowing a 
 slop factor in matching phrases.
 
 Thanks in advance for any help with this.
 
 T. R.
 t...@midrivers.com
 http://www.linkedin.com/in/trhalvorson
 www.ncodian.com
 http://twitter.com/trhalvorson 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene as a primary datastore

2010-01-20 Thread Otis Gospodnetic
Guido,

No, you should absolutely not need to constantly rebuild the index.  If you 
find you have to do that, you'll know you are doing something wrong.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Guido Bartolucci guido.bartolu...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Wed, January 20, 2010 4:25:09 PM
 Subject: Re: Lucene as a primary datastore
 
 Thanks for the response. I understand all of what you wrote, but what
 I care about and what I had a little trouble describing exactly in my
 previous question is:
 
 - Are all problems with Lucene obvious (e.g., you get an exception and
 you know your data is now bad) or are there subtle corruptions that
 just happen and because of that it makes sense to constantly rebuild
 the index?
 
 I ask this because if this isn't the case then replication isn't going
 to help, the problems probably get copied over to the other instances
 (unless I'm missing something).
 
 guido.
 
 
 On Wed, Jan 20, 2010 at 11:40 AM, Chris Lu wrote:
  I have 3 concerns of making Lucene as a primary database.
  1) Lucene is stable when it's stable. But you will have java exceptions.
  What would you do when FileNotFoundException or Lucene 2.9.1 'read past
  EOF' IOException under system load happens?
  For me, I don't the data is safe this way. Or, you can understand all Lucene
  APIs and never make any mistakes.
  Some databases, like some versions of mysql, could corrupt data. No better,
  but it's still more robust.
  2) As the name suggests, Lucene index is just an index, like database index,
  it's an auxiliary data structure. It's only fast in one way, but could be
  slow in other ways.
  3) The more robust approach is to pull data out of database, and create a
  Lucene index. In case something goes wrong, you can always pull data out
  again and create the index again.
 
  --
  Chris Lu
  -
  Instant Scalable Full-Text Search On Any Database/Application
  site: http://www.dbsight.net
  demo: http://search.dbsight.com
  Lucene Database Search in 3 minutes:
  
 http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
  DBSight customer, a shopping comparison site, (anonymous per request) got
  2.6 Million Euro funding!
 
 
 
  Guido Bartolucci wrote:
 
  I know that the primary use case for Lucene is as an index of data
  that can be reconstructed (e.g., from a relational database or from
  spidering your corporate intranet).
 
  But, I'm curious if anyone uses Lucene as their primary datastore for
  their gold data. Is it good enough?
 
  Would anyone consider (or do people already) store data in Lucene
  that, if it was lost, would destroy their business? And no, I'm not
  suggesting that you don't back up this data, I'm just curious if there
  are problems with using Lucene in this way. Are there subtle
  corruptions that might show up in Lucene that wouldn't show up in
  Oracle or MySQL?
 
  I'm considering using Lucene in this way but I haven't been able to
  find any documentation describing this use case. Are there any studies
  of Lucene vs MySQL running for N years comparing the corruptions and
  recovery times?
 
  Am I just ignorant and scared of Lucene and too trusting of Oracle and
  MySQL?
 
  Thanks.
 
  -guido.
 
  (BTW, I did find a similar question asked back in 2007 in the archives
  but it doesn't really answer my question)
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can you boost multiple terms using brackets ?

2010-01-20 Thread Otis Gospodnetic
Yes, I believe it is the same.  I bet the Explain explanation would help 
confirm this.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Paul Taylor paul_t...@fastmail.fm
 To: java-user@lucene.apache.org
 Sent: Wed, January 20, 2010 1:03:14 PM
 Subject: Can you boost multiple terms using brackets ?
 
 
 Hi
 
 is
 
 title:(return panther)^3 alias:(return panther)
 
 
 the same as
 
 title:return^3 title:panther^3 alias:(return panther)
 
 
 thanks Paul
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene as a primary datastore

2010-01-19 Thread Otis Gospodnetic
You are not alone, Guido.  It's a good question.  In my experience, Lucene is 
as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not 
corrupt it.  Of course, even with the most expensive databases, you'd want to 
make backups.  The same goes with Lucene.  Nowadays, one way people make 
backups is via replication. :)  Solr users thus often get backups for free, 
as do people who put copies of their data on file systems like HDFS, which tend 
to have replication turned on.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Guido Bartolucci guido.bartolu...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Tue, January 19, 2010 10:58:36 PM
 Subject: Lucene as a primary datastore
 
 I know that the primary use case for Lucene is as an index of data
 that can be reconstructed (e.g., from a relational database or from
 spidering your corporate intranet).
 
 But, I'm curious if anyone uses Lucene as their primary datastore for
 their gold data. Is it good enough?
 
 Would anyone consider (or do people already) store data in Lucene
 that, if it was lost, would destroy their business? And no, I'm not
 suggesting that you don't back up this data, I'm just curious if there
 are problems with using Lucene in this way. Are there subtle
 corruptions that might show up in Lucene that wouldn't show up in
 Oracle or MySQL?
 
 I'm considering using Lucene in this way but I haven't been able to
 find any documentation describing this use case. Are there any studies
 of Lucene vs MySQL running for N years comparing the corruptions and
 recovery times?
 
 Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL?
 
 Thanks.
 
 -guido.
 
 (BTW, I did find a similar question asked back in 2007 in the archives
 but it doesn't really answer my question)
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene as a primary datastore

2010-01-19 Thread Otis Gospodnetic
Have you seen the Hot Backups with Lucene paper available via 
http://www.manning.com/hatcher3/ ?

 
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Ganesh emailg...@yahoo.co.in
 To: java-user@lucene.apache.org
 Sent: Wed, January 20, 2010 1:13:21 AM
 Subject: Re: Lucene as a primary datastore
 
 We have data in compound files and we use Lucene as primary database. Its 
 working great and much faster with millions of records. The only issue, I 
 face 
 is with sorting. Lucene sorting consumes good amount of memory. I don't know 
 much about the MySQL/PostgreSQL database, and how they behave with millions 
 of 
 records but i guess their sorting memory consumption would be less.  
 
 It would be great, If Lucene has the ability to do backups / replication. I 
 don't know how to modify/use the solr script.  
 
 Regards
 Ganesh
 
 
 - Original Message - 
 From: Otis Gospodnetic 
 To: ; 
 Sent: Wednesday, January 20, 2010 10:45 AM
 Subject: Re: Lucene as a primary datastore
 
 
  You are not alone, Guido.  It's a good question.  In my experience, Lucene 
  is 
 as stable as MySQL/PostgreSQL in terms of its ability to hold your data and 
 not 
 corrupt it.  Of course, even with the most expensive databases, you'd want to 
 make backups.  The same goes with Lucene.  Nowadays, one way people make 
 backups is via replication. :)  Solr users thus often get backups for free, 
 as 
 do people who put copies of their data on file systems like HDFS, which tend 
 to 
 have replication turned on.
  
  Otis
  --
  Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
  
  
  
  - Original Message 
  From: Guido Bartolucci 
  To: java-user@lucene.apache.org
  Sent: Tue, January 19, 2010 10:58:36 PM
  Subject: Lucene as a primary datastore
  
  I know that the primary use case for Lucene is as an index of data
  that can be reconstructed (e.g., from a relational database or from
  spidering your corporate intranet).
  
  But, I'm curious if anyone uses Lucene as their primary datastore for
  their gold data. Is it good enough?
  
  Would anyone consider (or do people already) store data in Lucene
  that, if it was lost, would destroy their business? And no, I'm not
  suggesting that you don't back up this data, I'm just curious if there
  are problems with using Lucene in this way. Are there subtle
  corruptions that might show up in Lucene that wouldn't show up in
  Oracle or MySQL?
  
  I'm considering using Lucene in this way but I haven't been able to
  find any documentation describing this use case. Are there any studies
  of Lucene vs MySQL running for N years comparing the corruptions and
  recovery times?
  
  Am I just ignorant and scared of Lucene and too trusting of Oracle and 
  MySQL?
  
  Thanks.
  
  -guido.
  
  (BTW, I did find a similar question asked back in 2007 in the archives
  but it doesn't really answer my question)
  
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 Send instant messages to your online friends http://in.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A way to download URLs and index better ?

2010-01-16 Thread Otis Gospodnetic
Hello,

Use Droids, it's much simpler than Nutch or Heritrix:

http://incubator.apache.org/droids/

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Phan The Dai thienthanhom...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Sat, January 16, 2010 2:20:47 AM
 Subject: A way to download URLs and index better ?
 
 Hi everyone, please help me this question:
 I need downloading some webpages from a list of URLs (about 200 links) and
 then index them by Lucene.
 This list is not fixed, because it depends on definition of my process.
 Currently, in my web application, I wrote class for downloading, but it
 download time is too long.
 
 Please recommend me a Java library suitable with my situation for optimize
 downloading.
 More its examples are very wonderful (INPUT: list of URLs; OUTPUT: webpages
 content, or indexed repository)
 Thank you very much.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Otis Gospodnetic
I think Jason meant 15-20GB segments?
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch





From: Jason Rutherglen jason.rutherg...@gmail.com
To: java-user@lucene.apache.org
Sent: Wed, January 13, 2010 5:54:38 PM
Subject: Re: Max Segmentation Size when Optimizing Index

Yes... You could hack LogMergePolicy to do something else.

I use optimise(numsegments:5) regularly on 80GB indexes, that if
optimized to 1 segment, would thrash the IO excessively.  This works
fine because 15-20GB indexes are plenty large and fast.

On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittumrong mrt...@gmail.com wrote:
 Seems like optimize() only cares about final number of segments rather than
 the size of the segment. Is it so?

 On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 There's a different method in LogMergePolicy that performs the
 optimize... Right, so normal merging uses the findMerges method, then
 there's a findMergeOptimize (method names could be inaccurate).

 On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong mrt...@gmail.com
 wrote:
  Do you mean MergePolicy is only used during index time and will be
 ignored
  by by the Optimize() process?
 
 
  On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  Oh ok, you're asking about optimizing... I think that's a different
  algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
  param.
 
  On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong mrt...@gmail.com
 
  wrote:
   Thanks, Jason.
  
   Is my understanding correct that
  LogByteSizeMergePolicy.setMaxMergeMB(100)
   will prevent
   merging of two segments that is larger than 100 Mb each at the
 optimizing
   time?
  
   If so, why do think would I still see segment that is larger than 200
 MB?
  
  
  
   On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen 
   jason.rutherg...@gmail.com wrote:
  
   Hi Trin,
  
   There was recently a discussion about this, the max size is
   for the before merge segments, rather than the resultant merged
   segment (if that makes sense). It'd be great if we had a merge
   policy that limited the resultant merged segment, though that'd
   by a rough approximation at best.
  
   Jason
  
   On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong 
 mrt...@gmail.com
  
   wrote:
Hi,
   
   
   
I am trying to optimize the index which would merge different
 segment
together. Let say the index folder is 1Gb in total, I need each
   segmentation
to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
  *and
setMaxMergeMB(100) to ensure no segment after merging would be
 200Mb.
However, I still see segment that are larger than 200Mb. I did call
IndexWriter.optimize(20) to make sure there are enough number
   segmentation
to allow each segment to be under 200Mb.
   
   
   
Can someone let me know if I am using this right? Or any suggestion
 on
   how
to tackle this would be helpful.
   
   
   
Thanks,
   
Trin
   
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

NYC Search in the Cloud meetup: Jan 20

2010-01-12 Thread Otis Gospodnetic
Hello,

If Search Engine Integration, Deployment and Scaling in the Cloud sounds 
interesting to you, and you are going to be in or near New York next Wednesday 
(Jan 20) evening:

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/

Sorry for dupes to those of you subscribed to multiple @lucene lists.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to follow intranet: configuration in nutch website

2010-01-12 Thread Otis Gospodnetic
Zhou,

Your question will get more attention if you send it to 
nutch-u...@lucene.apache.org list instead.  This list is for Lucene Java.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: jyzhou...@yahoo.com jyzhou...@yahoo.com
 To: java-user@lucene.apache.org
 Sent: Tue, January 12, 2010 10:51:59 PM
 Subject: how to follow intranet: configuration in nutch website
 
 Hi,
 
 I try to following the instruction from 
 http://lucene.apache.org/nutch/tutorial8.html
 .
 Intranet: Configuration
 To configure things for intranet crawling you must:1. Create a directory with 
 a 
 flat file of root urls.  For example, to
 crawl the nutch site you might start with a file named
 urls/nutch containing the url of just the Nutch home
 page.  All other Nutch pages should be reachable from this page.  The
 urls/nutch file would thus contain:
 http://lucene.apache.org/nutch/
 
 
 
 not understand. Can anyone help me out. 
 
 Thanks.
 zhou
 
 
   New Email addresses available on Yahoo!
 Get the Email name you've always wanted on the new @ymail and @rocketmail. 
 Hurry before someone else does!
 http://mail.promotions.yahoo.com/newdomains/sg/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene index file randomly crash and need to reindex

2010-01-12 Thread Otis Gospodnetic
Hi,

Use the latest version of Lucene, obey Lucene's locks, write with 1 
IndexWriter, avoid NFS...

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: zhang99 second_co...@yahoo.com
 To: java-user@lucene.apache.org
 Sent: Tue, January 12, 2010 10:41:19 PM
 Subject: lucene index file randomly crash and need to reindex
 
 
 how you all deal wich such issue of occasionally need to reindex? what
 recommendation do you suggest to minimize this?
 -- 
 View this message in context: 
 http://old.nabble.com/lucene-index-file-randomly-crash-and-need-to-reindex-tp27139147p27139147.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: a complete solution for building a website search with lucene

2010-01-08 Thread Otis Gospodnetic
Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs 
that the JVM supports.
But it does contain some shell scripts, as does Hadoop that Nutch uses.  Oh, I 
guess Windows people run it under Cygwin?
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: jyzhou...@yahoo.com jyzhou...@yahoo.com
 To: java-user@lucene.apache.org
 Sent: Fri, January 8, 2010 5:03:41 AM
 Subject: Re: a complete solution for building a website search with lucene
 
 Hi Paul,
 
 Thanks. 
 Use Nutch to do crawling. and integrate Lucene to the web application, so 
 that 
 can do search online.
 
 BTW, Nutch seems to have only Linux version, what my development is on 
 Windows. 
 Am i right?
 
 Zhou
 
 --- On Fri, 8/1/10, Paul Libbrecht wrote:
 
 From: Paul Libbrecht 
 Subject: Re: a complete solution for building a website search with lucene
 To: java-user@lucene.apache.org
 Date: Friday, 8 January, 2010, 4:27 PM
 
 Zhou,
 
 Lucene is a back-end library, it's very useful for developer but it is not a 
 complete site-search-engine.
 A lucene-based site-search-engine is Nutch, it does crawl.
 Solr also provides functions close to these with a large amount of thoughts 
 on 
 flexible integration; crawling methods are rather based on feeds or other 
 acquisition methods (see DIH for example).
 
 paul
 
 
 
 
 Le 08-janv.-10 à 08:08, a écrit :
 
  Hi ,
  
  I am new in Lucene.
  
  To build a web search function, it need to have a backendc indexing 
  function. 
 But, before that, should run a Crawler? because Lucene index based on Html 
 documents, while Crawler can change the website pages to Html documents. Am i 
 right?
  
  If so, please anyone suggest to me a Crawler? like Nutch?
  Thanks
  Zhou
  
  
  
  
   New Email names for you!
  Get the Email name you've always wanted on the new @ymail and @rocketmail.
  Hurry before someone else does!
  http://mail.promotions.yahoo.com/newdomains/sg/
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
   New Email names for you! 
 Get the Email name you've always wanted on the new @ymail and @rocketmail. 
 Hurry before someone else does!
 http://mail.promotions.yahoo.com/newdomains/sg/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Otis Gospodnetic
Yuliya,

The index *directory* will be larger *while* you are optimizing.  After the 
optimization is completed successfully, the index directory will be smaller.  
It is possible that your index directory is large(r) because you have some 
left-over segments (e.g. from some earlier failed/interrupted optimizations) 
that are not really a part of the index.  After optimizing, you should have 
only 1 segment, so if you see more than 1 segment, look at the ones with older 
timestamps.  Those can be (re)moved.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Yuliya Palchaninava y...@solute.de
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Thu, January 7, 2010 11:23:08 AM
 Subject: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not 
 optimized index
 
 Hi,
 
 According to the api documentation: In general, once the optimize completes, 
 the total size of the index will be less than the size of the starting index. 
 It 
 could be quite a bit smaller (if there were many pending deletes) or just 
 slightly smaller. In our case the index becomes not smaller but larger, 
 namely 
 thrice as large. 
 
 The not optimized index doesn't contain compressed fields, what could have 
 caused the growth of the index due to the otimization. So we cannot explain 
 what 
 happens.
 
 Does someone have an explanation for the index growth due to the optimization?
 
 Thanks,
 Yuliya
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance Results on changing the way fields are stored

2010-01-07 Thread Otis Gospodnetic
You could try Avro instead of JSON/XML/Java Serialization.  It's compact (and 
new).

http://hadoop.apache.org/avro/

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Paul Taylor paul_t...@fastmail.fm
 To: java-user@lucene.apache.org
 Sent: Tue, January 5, 2010 7:44:21 AM
 Subject: Performance Results on changing the way fields are stored
 
 So currently in my index I index and store a number of small fields, I need 
 both 
 so I can search on the fields, then I use the stored versions to generate the 
 output document (which is either an XML or JSON representation), because I 
 read 
 stored and index fields are dealt with completely seperately I tried another 
 tact only storing one field which was a serialized version of the output 
 documentation. This solves a couple of issues I was having but I was 
 disappointed that both the size of the index increased and the index build  
 time 
 increased, I thought that if all the stored data was held in one field that 
 the 
 resultant index would be smaller, and I didn't expect index time to increase 
 by 
 as much as it did. I was also suprised that Java serilaization was slower and 
 used more space than both JSON and XML serialization.
 
 Results as Follows
 
 Type: Time : 
 Index 
 Size
 Only indexed  no norms
   
   105   : 38 MB
 Only indexed  
   
  111   : 43 MB
 Same fields written as Indexed and Stored  (current Situation)   115  
  : 
 83 MB
 Fields Indexed, One JAXB classed Stored using JSON Marshalling 140   : 115 MB
 Fields Indexed, One JAXB classed Stored using XML Marshalling  189   : 198 MB
 Fields Indexed, One JAXB classed Stored using Java Serialization   305   : 
 485 
 MB
 
 Are these results to be expected, could anybody suggest anything else I could 
 do
 
 
 Paul
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Otis Gospodnetic
Maybe you can paste a directory listing before optimization and after 
optimization?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Yuliya Palchaninava y...@solute.de
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Thu, January 7, 2010 11:50:29 AM
 Subject: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the 
 not optimized index
 
 Otis,
 
 thanks for the answer. 
 
 Unfortunatelly the index *directory* remains larger *after the optimization.
 In our case the otimization was/is completed successfully and, as you say,
 there is only one segment in the directory.
 
 Some other ideas?
 
 Thanks,
 Yuliya
 
  -Ursprüngliche Nachricht-
  Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
  Gesendet: Donnerstag, 7. Januar 2010 17:35
  An: java-user@lucene.apache.org
  Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as 
  large as the not optimized index
  
  Yuliya,
  
  The index *directory* will be larger *while* you are 
  optimizing.  After the optimization is completed 
  successfully, the index directory will be smaller.  It is 
  possible that your index directory is large(r) because you 
  have some left-over segments (e.g. from some earlier 
  failed/interrupted optimizations) that are not really a part 
  of the index.  After optimizing, you should have only 1 
  segment, so if you see more than 1 segment, look at the ones 
  with older timestamps.  Those can be (re)moved.
  
   Otis
  --
  Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
  
  
  
  - Original Message 
   From: Yuliya Palchaninava 
   To: java-user@lucene.apache.org 
   Sent: Thu, January 7, 2010 11:23:08 AM
   Subject: Lucene 2.9 and 3.0: Optimized index is thrice as 
  large as the 
   not optimized index
   
   Hi,
   
   According to the api documentation: In general, once the optimize 
   completes, the total size of the index will be less than 
  the size of 
   the starting index. It could be quite a bit smaller (if there were 
   many pending deletes) or just slightly smaller. In our 
  case the index 
   becomes not smaller but larger, namely thrice as large.
   
   The not optimized index doesn't contain compressed fields, 
  what could 
   have caused the growth of the index due to the otimization. So we 
   cannot explain what happens.
   
   Does someone have an explanation for the index growth due 
  to the optimization?
   
   Thanks,
   Yuliya
   
   
   
  -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to limit the size of an index?

2010-01-07 Thread Otis Gospodnetic
 Merge factor controls how many segments are merged at once.  The default is 
 10.
 
 The maxMergeMB setting sets the max size for a given segment to be
 included in a merge.

I wonder if renaming that to maxSegSizeMergeMB would make it more obvious what 
this does?

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

 Roughly, the upper bound on merged segments is the sum of their sizes.
 
 So the rough upper bound on any segment's size is mergeFactor * maxMergeMB.
 
 Mike
 
 On Thu, Jan 7, 2010 at 11:04 AM, Dvora wrote:
 
  Can you explain how the combination of merge factor and max merge size
  control the size of files?
 
  For example, if one would like to limit the files size to 3,4 or 7MB - how
  these parameters values can be predicted?
 
 
 
  Michael McCandless-2 wrote:
 
 
  This tells the IndexWriter NOT to merge any segment that's over 1.0 MB
  in size.  With a default merge factor of 10, this should generally
  mean you don't get a segment over 10MB, though it may not be a hard
  guarantee (you can lower the 1.0 if you still see a segment over 10
  MB).
 
 
 
  --
  View this message in context: 
 http://old.nabble.com/Is-there-a-way-to-limit-the-size-of-an-index--tp27056573p27062291.html
  Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to limit the size of an index?

2010-01-07 Thread Otis Gospodnetic
Sure, sounds good, maybe even drop ing.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-user@lucene.apache.org
 Sent: Thu, January 7, 2010 2:28:15 PM
 Subject: Re: Is there a way to limit the size of an index?
 
 On Thu, Jan 7, 2010 at 2:23 PM, Otis Gospodnetic
 wrote:
  Merge factor controls how many segments are merged at once.  The default 
  is 
 10.
 
  The maxMergeMB setting sets the max size for a given segment to be
  included in a merge.
 
  I wonder if renaming that to maxSegSizeMergeMB would make it more obvious 
  what 
 this does?
 
 Well... that setting is already in LogByteSizeMergePolicy (not
 IndexWriter), so I think in that context it's pretty clear?
 
 Though I'd love to find a better name that conveys that the size
 limitation applies to the segments *being* merged, not to the
 resulting merged segment.  maxStartingSegSizeMB?
 
 Mike
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Implementing filtering based on multiple fields

2010-01-07 Thread Otis Gospodnetic
For something like CSE, I think you want to isolate users and their 
data/indices.

I'd look at Bixo or Nutch or Droids == Lucene or Solr

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Yaniv Ben Yosef yani...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, January 7, 2010 3:54:22 PM
 Subject: Implementing filtering based on multiple fields
 
 Hi,
 
 I'm very new to Lucene. In fact, I'm at the beginning of an evaluation
 phase, trying to figure whether Lucene is the right fit for my needs.
 The project I'm involved in requires something similar to the Google Custom
 Search Engine (CSE). In CSE, each user can
 define a set (could be a large set) of websites, and limit the search to
 only those websites. So for example, I can create a CSE that searches all
 web pages on cnn.com, msnbc.com and nytimes.com only.
 I am trying to understand whether and how I can do something similar in
 Lucene.
 
 The FAQ hints about this possibility
 here,
 but it mentions a class that no longer exists in 3.0 (QueryFilter), and is
 very laconic about the suggested options. Also I'm not sure how well it will
 perform in my use case (or even if it fits at all).
 I thought about creating a separate index for each user or CSE. However, my
 system should be able to handle tens of thousands of concurrent users. I
 haven't done any analysis yet on how this will affect CPU, RAM, I/O and
 storage size, but was wondering if any of you experienced Lucene
 users/developers think it's a good direction.
 If that's not a good idea, what would be a good strategy here?
 
 Any help will be much appreciated,
 Yaniv


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Implementing filtering based on multiple fields

2010-01-07 Thread Otis Gospodnetic
Ah, well, masking it didn't help.  Yes, ignore Bixo, Nutch, and Droids then.
Consider DataImportHandler from Solr or wait a bit for Lucene Connectors 
Framework to materialize.  Or use LuSql, or DbSight, or Sematext's Database 
Indexer.

Yes, I was suggesting a separate index for each user.  That's what Simpy uses 
and has some 200K indices on 1 box and I think dozens of QPS without any 
caching, if I remember correctly.  Load is under 1.0.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Yaniv Ben Yosef yani...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, January 7, 2010 6:55:18 PM
 Subject: Re: Implementing filtering based on multiple fields
 
 Thanks Otis.
 
 If I understand correctly - Bixo, Nutch and Droids are technologies to use
 for crawling the web and building an index. My project is actually about
 indexing a large database, where you can think of every row as a web page,
 and a particular column is the equivalent of a web site. (I didn't mention
 that in the previous post because I didn't want to complicate my question,
 and it seems equivalent to Google CSE given that Lucene can use virtually
 any input for indexing, AFAIK)
 Therefore I'm not sure if the frameworks you've mentioned are applicable to
 my project as they seem to be related to web page indexing, but perhaps I'm
 missing something.
 Also, what did you mean about isolating users and their data/indices. Did
 you mean that I should create a separate index per user?
 
 Thanks again!
 
 On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:
 
  For something like CSE, I think you want to isolate users and their
  data/indices.
 
  I'd look at Bixo or Nutch or Droids == Lucene or Solr
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
 
 
 
  - Original Message 
   From: Yaniv Ben Yosef 
   To: java-user@lucene.apache.org
   Sent: Thu, January 7, 2010 3:54:22 PM
   Subject: Implementing filtering based on multiple fields
  
   Hi,
  
   I'm very new to Lucene. In fact, I'm at the beginning of an evaluation
   phase, trying to figure whether Lucene is the right fit for my needs.
   The project I'm involved in requires something similar to the Google
  Custom
   Search Engine (CSE). In CSE, each user can
   define a set (could be a large set) of websites, and limit the search to
   only those websites. So for example, I can create a CSE that searches all
   web pages on cnn.com, msnbc.com and nytimes.com only.
   I am trying to understand whether and how I can do something similar in
   Lucene.
  
   The FAQ hints about this possibility
   here,
   but it mentions a class that no longer exists in 3.0 (QueryFilter), and
  is
   very laconic about the suggested options. Also I'm not sure how well it
  will
   perform in my use case (or even if it fits at all).
   I thought about creating a separate index for each user or CSE. However,
  my
   system should be able to handle tens of thousands of concurrent users. I
   haven't done any analysis yet on how this will affect CPU, RAM, I/O and
   storage size, but was wondering if any of you experienced Lucene
   users/developers think it's a good direction.
   If that's not a good idea, what would be a good strategy here?
  
   Any help will be much appreciated,
   Yaniv
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NGramTokenizer stops working after about 1000 terms

2010-01-03 Thread Otis Gospodnetic
This actually rings a bell for me... have a look at Lucene's JIRA, I think this 
was reported as a bug once and perhaps has been fixed.


Note that Lucene in Action 2 has a case study that talks about searching source 
code.  You may find that study interesting.
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Stefan Trcek wzzelfz...@abas.de
 To: java-user@lucene.apache.org
 Sent: Mon, December 14, 2009 9:39:34 AM
 Subject: NGramTokenizer stops working after about 1000 terms
 
 Hello
 
 For a source code (git repo) search engine I choose to use an ngram 
 analyzer for substring search (something like git blame).
 
 This worked fine except it didn't find some strings. I tracked it down 
 to the analyzer. When the ngram analyzer yielded about 1000 terms it 
 stopped yielding more terms, seem to be at most (1024 - ngram_length) 
 terms. When I use StandardAnalyzer it works as expected.
 Is this a bug or did I miss a limit?
 
 Tested with lucene-2.9.1 and 3.0, this is the core routine I use:
 
 public static class NGramAnalyzer5 extends Analyzer {
 public TokenStream tokenStream(String fieldName, Reader reader) {
 return new NGramTokenizer(reader, 5, 5);
 }
 }
 
 public static String[] analyzeString(Analyzer analyzer,
 String fieldName, String string) throws IOException {
 Listoutput = new ArrayList();
 TokenStream tokenStream = analyzer.tokenStream(fieldName,
 new StringReader(string));
 TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(
 TermAttribute.class);
 tokenStream.reset();
 while (tokenStream.incrementToken()) {
 output.add(termAtt.term());
 }
 tokenStream.end();
 tokenStream.close();
 return output.toArray(new String[0]);
 }  
 
 The complete example is attached. in.txt must be in . and is plain 
 ASCII.
 
 Stefan
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Snowball Stemmer Question

2009-12-03 Thread Otis Gospodnetic
Chris,

You could look at KStem to see if that does a better job.
Or perhaps WordNet can be used to get the lemma of those terms instead of using 
stemming.
Finally what was I going to say... ah, yes, using synonyms may be another 
way this can be handled.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Christopher Condit con...@sdsc.edu
 To: java-user@lucene.apache.org java-user@lucene.apache.org
 Sent: Thu, December 3, 2009 3:04:03 PM
 Subject: Snowball Stemmer Question
 
 The Snowball Analyzer works well for certain constructs but not others. In 
 particular I'm having a problem with things like colossal vs colossus and 
 hippocampus vs hippocampal.
 Is there a way to customize the analyzer to include these rules?
 Thanks,
 -Chris
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting score of explicit documents for a query

2009-12-03 Thread Otis Gospodnetic
I think you should be able to use 1+ FilteredQuery (with IDs of your docs) with 
your main query and thus get the scores only for docs that interest you.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Erdinc Yilmazel erd...@yilmazel.com
 To: java-user@lucene.apache.org
 Sent: Thu, December 3, 2009 11:37:08 AM
 Subject: Getting score of explicit documents for a query
 
 Hi,
 
 Given a query, is there a way to learn score of some specific documents in
 the index against this query? I don't want to make a global search in the
 index and rank and sort all the matching documents. What I want to do is
 learn the rank of a bunch of documents in the index that I can identify by
 document id..
 
 Erdinc


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



NYC Search Discovery Meetup

2009-12-01 Thread Otis Gospodnetic
Hello,

For those living in or near NYC, you may be interested in joining (and/or 
presenting?) at the NYC Search  Discovery Meetup.
Topics are: search, machine learning, data mining, NLP, information gathering, 
information extraction, etc.

  http://www.meetup.com/NYC-Search-and-Discovery/

Our previous/first meetup was about solr-python and parse.ly (a service that 
makes use of Solr and solr-python).

Tomorrow (December 2 2009) we have:

  Incorporating Probabilistic Retrieval Knowledge into TFIDF-based Search Engine

You can RSVP at:
  http://www.meetup.com/NYC-Search-and-Discovery/calendar/11745435/

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Need help regarding implementation of autosuggest using jquery

2009-12-01 Thread Otis Gospodnetic
Hi,

Have a look at http://www.sematext.com/products/autocomplete/index.html

It handles Chinese and large volumes of data.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: fulin tang tangfu...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Thu, November 26, 2009 9:10:41 PM
 Subject: Re: Need help regarding implementation of autosuggest using jquery
 
 By the way , we search Chinese words, so Trie tree looks not perfect
 for us either
 
 
 2009/11/27 fulin tang :
  We have the same needs in our music search, and we found this is not a
  good approach for performance reason .
 
  Did any one have experience of implement the autosuggestion in a heavy
  product environment ?
  Any suggestions ?
 
 
  2009/11/26 Anshum :
  Try this,
  Change the code as required:
  -
 
 
  import java.io.IOException;
 
  import org.apache.lucene.index.CorruptIndexException;
  import org.apache.lucene.index.IndexReader;
  import org.apache.lucene.index.Term;
  import org.apache.lucene.index.TermEnum;
 
  /**
   * @author anshum
   *
   */
  public class GetTermsToSuggest {
 
  private static void getTerms(String inputText) {
  IndexReader reader = null;
   try {
  reader = IndexReader.open(/home/anshum/index/testindex);
   String field = fieldname;
  field = field.intern();
  TermEnum tenum = reader.terms(new Term(fieldname, ));
   Boolean hasRun = false;
  try {
  do {
   final Term term = tenum.term();
  if (term == null || term.field() != field)
   break;
  final String termText = term.text();
  if (termText.startsWith(inputText)) {
   System.out.println(termText);
  hasRun = true;
  } else if (hasRun == true)
   break;
  } while (tenum.next());
  tenum.close();
   } catch (IOException e) {
  e.printStackTrace();
  }
   } catch (CorruptIndexException e2) {
  e2.printStackTrace();
  } catch (IOException e2) {
   e2.printStackTrace();
  }
 
  }
 
  /**
   * @param args
   */
   public static void main(String[] args) {
  GetTermsToSuggest.getTerms(args[0]);
   }
  }
 
 
  --
  Anshum Gupta
  Naukri Labs!
  http://ai-cafe.blogspot.com
 
  The facts expressed here belong to everybody, the opinions to me. The
  distinction is yours to draw
 
 
  On Thu, Nov 26, 2009 at 3:19 PM, Uwe Schindler wrote:
 
  You can fix this if you just create the initial term not with , instead
  with your prefix:
  TermEnum tenum = reader.terms(new Term(field,prefix));
 
  And inside the while loop just break out,
 
  if (!termText.startsWith(prefix)) break;
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: DHIVYA M [mailto:dhivyakrishna...@yahoo.com]
   Sent: Thursday, November 26, 2009 10:39 AM
   To: java-user@lucene.apache.org
   Subject: RE: Need help regarding implementation of autosuggest using
   jquery
  
   Sir,
  
   Your suggestion was fantastic.
  
   I tried the below mentioned code but it is showing me the entire result
  of
   indexed words starting from the letter that i give as input.
   Ex:
   if i give fo
   am getting all the indexes from the word starting with fo upto words
   starting with z.
   i.e. it starts displaying from the word matching the search word and 
   ends
   up with the last word available in the index file.
  
   Kindly suggest me a solution for this problem
  
   Thanks in advance,
   Dhivya
  
   --- On Wed, 25/11/09, Uwe Schindler wrote:
  
  
   From: Uwe Schindler 
   Subject: RE: Need help regarding implementation of autosuggest using
   jquery
   To: java-user@lucene.apache.org
   Date: Wednesday, 25 November, 2009, 9:54 AM
  
  
   Hi Dhivya,
  
   you can iterate all terms in the index using a TermEnum, that can be
   retrieved using IndexReader.terms(Term startTerm).
  
   If you are interested in all terms from a specific field, position the
   TermEnum on the first possible term in this field () and iterate until
   the
   field name changes. As terms in the TermEnum are first ordered by field
   name
   then by term text (in UTF-16 order), the loop would look like this:
  
   IndexReader reader = ...
   String field = 
   Field = field.intern(); // important for the while loop
   TermEnum tenum = reader.terms(new Term(field,));
   try {
   do {
   final Term term = tenum.term();
   if (term==null || term.field()!=field) break;
   final String termText = term.text();
   // do something with the termText
   } while (tenum.next());
   } finally {
   tenum.close();
   }
  
  
   -
   Uwe Schindler
   H.-H.-Meier-Allee 63, D-28213 Bremen
   http://www.thetaphi.de
   eMail: u...@thetaphi.de
  
  
-Original Message-
From: DHIVYA M [mailto:dhivyakrishna...@yahoo.com]
Sent: Wednesday, November 25, 2009 8:06 AM
To: java user
Subject: Need help regarding implementation of autosuggest using 
jquery
   
Hi all,
   
Am using lucene 

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-24 Thread Otis Gospodnetic
For what it's worth, AOL uses a Solr cluster to handle searches for @aol users. 
 Each user has his own index.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: fulin tang tangfu...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Mon, November 23, 2009 9:35:57 PM
 Subject: Is Lucene a good choice for PB scale mailbox search?
 
 We are going to add full-text search for our mailbox service .
 
 The problem is we have more than 1 PB mails there , and obviously we
 don't want to add another PB storage for search service , so we hope
 the index data will be small enough for storage while the search keeps
 fast .
 
 The lucky is that every user just search with mails of their own , so
 we can split the data into a lot of indexes instead of keeping them in
 a big one .
 
 So, after all these concerns ,  the question is , is lucene a good
 choice for this ? or which is the right way to do this ? Does anyone
 have done this  before ?
 
 All opinions and comments are welcome !
 
 fulin
 
 
 -- 
 梦的开始挣扎于城市的边缘
 心的远方执着在脚步的瞬间
 我的宿命埋藏了寂寞的永远
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene not returning correct results eventhough search query is present

2009-11-18 Thread Otis Gospodnetic
Hi,


Please use java-user list for user questions.

Are you sure the file got fully indexed in the first place?  Use Luke to check.

Also, see:
IndexWriter.MaxFieldLength

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: udayKIRAN udayacc2...@yahoo.com
 To: java-...@lucene.apache.org
 Sent: Thu, November 19, 2009 12:08:32 AM
 Subject: lucene not returning correct results eventhough search query is 
 present
 
 
 hi,
 i am lucene to search log files. but i am not able search any words in the
 file that are after a certain line. i am using file reader to serach. Lucene
 is searching only upto a certain line in the file. can anyone hepl me.
 these are few lines of my code
 IndexWriter writer =
 new IndexWriter(idx, new StandardAnalyzer(), true);
 writer.addDocument(createDocument(filename,
  new FileReader(new File(filepath;
 writer.optimize();
 writer.close();
 public static Document createDocument(String folderpath, FileReader fr) {
 Document doc = new Document();
 doc.add( new Field(title, folderpath,Field.Store.YES,
 Field.Index.TOKENIZED
));  
 doc.add(new Field(content, fr  ));
 
 return doc;
 }
 //search function
   public static void search(Searcher searcher, String queryString)
 throws ParseException, IOException {  
 Query query = new QueryParser(content,new
 StandardAnalyzer()).parse(queryString );
 // Search for the query
 Hits hits = searcher.search(query  );
TopDocs topdocs = searcher.search(query ,1  );
//System.out.println(topdocs.);
 // Examine the Hits object to see if there were any matches
 int hitCount = hits.length();
 if (hitCount == 0) {
 System.out.println(
 No matches were found for \ + queryString + \);
 }
 else {
 System.out.println(Hits for \ +
 queryString + \ were found in files by:);
 for (int i = 0; i  hitCount; i++) {
 Document doc = hits.doc(i);
 System.out.println(   + (i + 1) + .  +
 doc.get(title));
 }
 }
 System.out.println();
 }
 -- 
 View this message in context: 
 http://old.nabble.com/lucene-not-returning-correct-results-eventhough-search-query-is-present-tp26420491p26420491.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Otis Gospodnetic
Hello,

Most likely due to the operating system caching the relevant portions of the 
index after the first set of queries.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Dinh pcd...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Tue, November 17, 2009 12:39:14 PM
 Subject: Why Lucene takes longer time for the first query and less for  
 subsequent ones
 
 Hi all,
 
 I made a list of 4 simple, singe term queries and do 4 searches via Lucene
 and find that if the term is used for search in the first time, Lucene takes
 quite a bit time to handle it.
 
 - Query A
 00:27:28,781  INFO LuceneSearchService:151 - Internal search took
 328.21463ms
 00:27:28,781  INFO SearchController:86 - Page rendered in 338.29553ms
 
 - Query B
 00:27:39,171  INFO LuceneSearchService:151 - Internal search took
 480.30908ms
 00:27:39,187  INFO SearchController:86 - Page rendered in 493.07327ms
 
 - Query C
 00:27:46,765  INFO LuceneSearchService:151 - Internal search took
 189.33635ms
 00:27:46,765  INFO SearchController:86 - Page rendered in 195.43823ms
 
 - Query D
 00:28:00,312  INFO LuceneSearchService:151 - Internal search took 330.3596ms
 00:28:00,328  INFO SearchController:86 - Page rendered in 347.34747ms
 
 
 It looks no good at the first glance because I have only 500 000 indexed
 documents. However, when I searched them again I found that Lucene run much
 faster.
 
 - Query A
 00:28:04,046  INFO LuceneSearchService:151 - Internal search took 3.90301ms
 00:28:04,062  INFO SearchController:86 - Page rendered in 15.694173ms
 
 - Query C
 00:28:15,390  INFO LuceneSearchService:151 - Internal search took 1.425879ms
 00:28:15,390  INFO SearchController:86 - Page rendered in 7.946541ms
 
 - Query D
 00:28:26,031  INFO LuceneSearchService:151 - Internal search took 1.849956ms
 00:28:26,046  INFO SearchController:86 - Page rendered in 12.023037ms
 
 - Query B
 00:28:31,609  INFO LuceneSearchService:151 - Internal search took 1.668648ms
 00:28:31,625  INFO SearchController:86 - Page rendered in 15.57237ms
 
 Why does it happens? Does it mean that Lucene has an internal cache engine,
 just like MySQL query result cache or Oracle query execution plan cache?
 
 Thanks
 
 Dinh


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Otis Gospodnetic
Well, I think some people will be for hiding complexity, while others will be 
for being in control and having transparency.  Think how surprised one would be 
to find 1 extra field in his index, say when looking at their index with Luke. 
:)
 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Glen Newton glen.new...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Tue, November 17, 2009 10:53:01 PM
 Subject: Re: Lucene Java 3.0.0 RC1 now available for testing
 
 I understand the reasons, but - if I may ask so late in the game - was
 this the best way to do this?
 
 From a user (developer) perspective, this is an implementation issue.
 Couldn't this have been done behind the scenes, so that when I asked
 for Field.Index.ANALYZED   Field.Store.COMPRESS, instead of what
 previously happened (and was variously problematic), two fields were
 transparently created, one being binary compressed stored and the
 other being indexed only? The Field API could hide all of this
 complexity, using one underlying Field when I use Field.getString()
 (compressed stored one), using the other when I use Field.setBoost()
 (the indexed one) and both when I call Field.setValue(). This might
 have less impact on developers and be less disruptive on API changes.
 Oh, some naming convention could handle the underlying Fields.
 
 A little complicated I agree.
 
 Again, apologies to those who worked hard on these changes: my fault
 for not noticing this sooner (I hadn't started moving my code to 2.9
 from 2.4 so I hadn't read the deprecation signs).
 
 thanks,
 
 Glen
 
 
 
 2009/11/17 Mark Miller :
  Here is some of the history:
 
  https://issues.apache.org/jira/browse/LUCENE-652
  https://issues.apache.org/jira/browse/LUCENE-1960
 
  Glen Newton wrote:
  Could someone send me where the rationale for the removal of
  COMPRESSED fields is? I've looked at
  
 http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior
  but it is a little light on the 'why' of this change.
 
  My fault - of course - for not paying attention.
 
  thanks,
  Glen
 
  2009/11/17 Uwe Schindler :
 
  Hello Lucene users,
 
 
 
  On behalf of the Lucene dev community (a growing community far larger than
  just the committers) I would like to announce the first release candidate
  for Lucene Java 3.0.
 
 
 
  Please download and check it out - take it for a spin and kick the tires. 
  If
  all goes well, we hope to release the final version of Lucene 3.0 in a
  little over a week.
 
 
 
  The new version is mostly a cleanup release without any new features. All
  deprecations targeted to be removed in version 3.0 were removed. If you 
  are
  upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
  warnings in your code base to be able to recompile against this version.
 
 
 
  This is the first Lucene release with Java 5 as a minimum requirement. The
  API was cleaned up to make use of Java 5's generics, varargs, enums, and
  autoboxing. New users of Lucene are advised to use this version for new
  developments, because it has a clean, type safe new API. Upgrading users 
  can
  now remove unnecessary casts and add generics to their code, too. If you
  have not upgraded your installation to Java 5, please read the file
  JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
  3.0, it will also happen with any previous release when you upgrade your
  Java environment).
 
 
 
  Lucene 3.0 has some changes regarding compressed fields: 2.9 already
  deprecated compressed fields; support for them was removed now. Lucene 3.0
  is still able to read indexes with compressed fields, but as soon as 
  merges
  occur or the index is optimized, all compressed fields are decompressed 
  and
  converted to Field.Store.YES. Because of this, indexes with compressed
  fields can suddenly get larger.
 
 
 
  While we generally try and maintain full backwards compatibility between
  major versions, Lucene 3.0 has some minor breaks, mostly related to
  deprecation removal, pointed out in the 'Changes in backwards 
  compatibility
  policy' section of CHANGES.txt. Notable are:
 
 
 
  - IndexReader.open(Directory) now opens in read-only mode per default 
  (this
  method was deprecated because of that in 2.9). The same occurs to
  IndexSearcher.
 
  - Already started in 2.9, core TokenStreams are now made final to enforce
  the decorator pattern.
 
  - If you interrupt an IndexWriter merge thread, IndexWriter now throws an
  unchecked ThreadInterruptedException that extends RuntimeException and
  clears the interrupt status.
 
 
 
  Also, remember that this is a release candidate, and not the final Lucene
  3.0 release.
 
 
 
  You can find the full list of changes here:
 
 
 
  HTML version:
 
  

Re: OutofMemory in large index

2009-11-13 Thread Otis Gospodnetic
Hello,

 
Comments inlined.


- Original Message 
 From: vsevel v.se...@lombardodier.com
 To: java-user@lucene.apache.org
 Sent: Fri, November 13, 2009 11:32:02 AM
 Subject: Re: OutofMemory in large index
 
 
 Hi, I am jumping into the thread because I have got a similar issue.
 My index is 30Gb large and contains 21M docs.
 I was able to stay with 1Gb of RAM on the server for a while. Recently I

Is that 1GB heap or 1GB RAM?

 started to simulate parallel searches. Just 2 parallel searches would get
 the server to crash with out of memory errors. I upgraded the server to 3Gb
 of RAM and I was able to run happily 10 parallel full text searches on my
 documents.
 My questions:
 - is 3Gb a relatively normal amount of memory for a server doing lucene
 searches?

These days 3GB of RAM is very little even for a laptop. :)

 - when is that going to stop? I am planning to have at least 40M docs in my
 index. will I need to go from 2.5 to 5Gb of RAM? what about 60M docs? what
 about 20 concurrent searches?

The more you hit the machine, the more resources it needs.  The more resource 
intensive the queries (e.g. sorting?  fuzzy?  wildcard?), the more resources 
they'll need.

One instance of Lucene/Solr I looked at today has an index with  5M not very 
large documents, but high query rates and relatively expensive queries hitting 
a 20GB index.  Each of 10 servers has 8 cores that were only about 30% idle.  
This is just an example.  Each case is different.

 - are there any safety mechanisms that would get a search to abort rather
 than make the server crash with out of memory?

I don't think so.  When an app hits OOM, I think it doesn't have much control 
over its destiny.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 Simon Willnauer wrote:
  
  On Fri, Nov 13, 2009 at 11:17 AM, Ian Lea wrote:
  I got OutOfMemoryError at
  org.apache.lucene.search.Searcher.search(Searcher.java:183)
  My index is 43G bytes.  Is that too big for Lucene ?
  Luke can see the index has over 1800M docs, but the search is also out
  of memory.
  I use -Xmx1024M to specify 1G java heap space.
 
  43Gb is not too big for lucene, but it certainly isn't small and that
  is a lot of docs.  Just give it more memory.
  I would strongly recommend to give it more memory, what version of
  lucene do you use? Depending on your setup you could run into a JVM
  bug if you use a lucene version  2.9. Your index is big enough
  (document wise) that you norms file grows  100MB, depending on your
  Xmx settings this could trigger a false OOM during index open. So if
  you are using  2.9 check out this issue
  https://issues.apache.org/jira/browse/LUCENE-1566
  
  
 
  One abnormal thing is that I broke a running optimize of this index.
  Is that can be a problem ?
 
  Possibly ...
  In general, this should not be a problem. The optimize will not
  destroy the index you are optimizing as segments are write once.
 
  If so, how can I fix an index after optimize process is broken.
 
  Probably depends on what you mean by broken.  Start with running
  org.apache.lucene.index.CheckIndex.  That can also fix some things -
  but see the warning in the javadocs.
  100% recommended to make sure nothing is wrong! :)
 
 
  --
  Ian.
  
 
 -- 
 View this message in context: 
 http://old.nabble.com/OutofMemory-in-large-index-tp26332397p26339388.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Prefix Query for autocomplete - TooManyClauses

2009-11-13 Thread Otis Gospodnetic
Hello,

Also keep in mind prefix queries are not the cheapest.
Plug:
We've seen people use this successfully: 
http://www.sematext.com/products/autocomplete/index.html
I believe somebody is trying this out with a set of 1B suggestions.  The demo 
at http://www.sematext.com/demo/ac/index.html searches 6M Wikipedia titles with 
a a *tiny* JVM heap.

Otis




- Original Message 
 From: Anjana Sarkar anjana...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Fri, November 13, 2009 8:50:38 AM
 Subject: Prefix Query for autocomplete - TooManyClauses
 
 We are using lucene for one our projects here and has been working very well
 for last 2 years.
 The new requirement is to use it for autocomplete. Here , queries like a* or
 ab* pose a problem.
 I have set BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ) to get around
 the TooManyClausesException.
 The issue now is performance is not acceptable. It takes about 3 secs for a*
 query to return results.
 I have 250,000 documents , each document is 5 - 15 words in the indexed
 field and am using StandardAnalyzer. I have tried using a filter,
 since in this case, I am only interested in documents with a boost higher
 than a certain number. I had
 the boost value as a separate lucene indexed field so I can filter on it.
 I realized that the filtering is only applied after the boolean query is
 prepared and scored, so there is no performance benefit with using that
 approach.
 I cannot use a ConstantScoreQuery as I need the top n matches for the query.
 Any suggestions on how I can get around this issue will be highly
 appreciated.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene index write performance optimization

2009-11-10 Thread Otis Gospodnetic
This is what we have in Lucene in Action 2:

~/lia2$ ff \*Thread\*java
./src/lia/admin/CreateThreadedIndexTask.java
./src/lia/admin/ThreadedIndexWriter.java

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Jamie Band ja...@stimulussoft.com
 To: java-user@lucene.apache.org
 Sent: Tue, November 10, 2009 11:43:30 AM
 Subject: Lucene index write performance optimization
 
 Hi There
 
 Our app spends alot of time waiting for Lucene to finish writing to the 
 index. 
 I'd like to minimize this. If you have a moment to spare, please let me know 
 if 
 my LuceneIndex class presented below can be improved upon.
 
 It is used in the following way:
 
 luceneIndex = new LuceneIndex(Config.getConfig().getIndex().getIndexBacklog(),
exitReq,volume.getID()+ 
 indexer,volume.getIndexPath(),
   
 Config.getConfig().getIndex().getMaxSimultaneousDocs());
 Document doc = new Document();
 IndexInfo indexInfo = new IndexInfo(doc);
 luceneIndex.indexDocument(indexInfo);
 
 As an aside note, is there any way for Lucene to support simultaneous writes 
 to 
 an index? For example, each write threads could write to a separate shard, 
 after 
 a period the shared could be merged into a single index? Or is this overkill? 
 I 
 am interested hear the opinion of the Lucene experts.
 
 Thanks in advance
 
 Jamie
 
 package com.stimulus.archiva.index;
 
 import java.io.File;
 import java.io.IOException;
 import java.io.PrintStream;
 import org.apache.commons.logging.*;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.index.*;
 import org.apache.lucene.store.FSDirectory;
 import java.util.*;
 import org.apache.lucene.store.LockObtainFailedException;
 import org.apache.lucene.store.AlreadyClosedException;
 import java.util.concurrent.locks.ReentrantLock;
 import java.util.concurrent.*;
 
 public class LuceneIndex extends Thread {
   protected ArrayBlockingQueuequeue;
 protected static final Log logger = 
 LogFactory.getLog(LuceneIndex.class.getName());
 protected static final Log indexLog = LogFactory.getLog(indexlog);
IndexWriter writer = null;
protected static ScheduledExecutorService scheduler;
 protected static ScheduledFuture scheduledTask;
 protected LuceneDocument EXIT_REQ = null;
 ReentrantLock indexLock = new ReentrantLock();
 ArchivaAnalyzer analyzer = new ArchivaAnalyzer();
 File indexLogFile;
 PrintStream indexLogOut;
 IndexProcessor indexProcessor;
 String friendlyName;
 String indexPath;
 int maxSimultaneousDocs;
public LuceneIndex(int queueSize, LuceneDocument exitReq,
 String friendlyName, String indexPath, int  
 maxSimultaneousDocs) {
this.queue = new ArrayBlockingQueue(queueSize);
this.EXIT_REQ = exitReq;
this.friendlyName = friendlyName;
this.indexPath = indexPath;
this.maxSimultaneousDocs = maxSimultaneousDocs;
setLog(friendlyName);
}
  public int getMaxSimultaneousDocs() {
  return maxSimultaneousDocs;
  }
  public void setMaxSimultaneousDocs(int maxSimultaneousDocs) {
  this.maxSimultaneousDocs = maxSimultaneousDocs;
  }
  public ReentrantLock getIndexLock() {
  return indexLock;
  }
  protected void setLog(String logName) {
 
try {
indexLogFile = getIndexLogFile(logName);
if (indexLogFile!=null) {
if (indexLogFile.length()10485760)
indexLogFile.delete();
indexLogOut = new PrintStream(indexLogFile);
}
logger.debug(set index log file path 
 {path='+indexLogFile.getCanonicalPath()+'});
} catch (Exception e) {
logger.error(failed to open index log 
 file:+e.getMessage(),e);
}
  }
protected File getIndexLogFile(String logName) {
   try {
String logfilepath = 
 Config.getFileSystem().getLogPath()+File.separator+logName+index.log;
return new File(logfilepath);
} catch (Exception e) {
logger.error(failed to open index log 
 file:+e.getMessage(),e);
return null;
}
  }
   protected void openIndex() throws 
 MessageSearchException {
Exception lastError = null;
  if (writer==null) {
logger.debug(openIndex() index +friendlyName+ 

Re: Filtering query results based on relevance/acuracy

2009-09-22 Thread Otis Gospodnetic
Alex,

If I understand you correctly, all you have to do is either make sure that 
query is run as a phrase query (with quotes around the it), or as a term query 
where both terms are required (with plus sign in front of each term, no space).


As for detecting score gap and such, you could do that with a custom Collector.

Otis --
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Alex azli...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Monday, September 21, 2009 6:17:53 PM
 Subject: Filtering query results based on relevance/acuracy
 
 Hi,
 
 I'm, a total newbie with lucene and trying to understand how to achieve my
 (complicated) goals. So what I'm doing is yet totally experimental for me
 but is probably extremely trivial for the experts in this list :)
 
 I use lucene and Hibernate Search to index locations by their name, type,
 etc ...
 The LocationType is an Object that has it's name field indexed both
 tokenized and untokenized.
 
 The following LocationType names are indexed
 Restaurant
 Mexican Restaurant
 Chinese Restaurant
 Greek Restaurant
 etc...
 
 Considering the following query  :
 
 Mexican Restaurant
 
 I systematically get all the entries as a result, most certainly because the
 Restaurant keyword is present in all of them.
 I'm trying to have a finer grained result set.
 Obviously for Mexican Restaurant I want the Mexican Restaurant entry as
 a result but NOT Chinese Restaurant nor Greek Restaurant as they are
 irrelevant. But maybe Restaurant itself should be returned with a lower
 wight/score or maybe it shouldn't ... im not sure about this one.
 
 1)
 How can I do that ?
 
 Here is the code I use for querying :
 
 
 String[] typeFields = {name, tokenized_name};
 MapboostPerField = new HashMap(2);
 boostPerField.put( name, (float) 4);
 boostPerField.put( tokenized_name, (float) 2);
 
 
 QueryParser parser = new MultiFieldQueryParser(
 typeFields ,
 new StandardAnalyzer(),
 boostPerField
 );
 
 org.apache.lucene.search.Query luceneQuery;
 
 try {
 luceneQuery = parser.parse(queryString);
 }
 catch (ParseException e) {
 throw new RuntimeException(Unable to parse query:  +
 queryString, e);
 }
 
 
 
 
 
 I guess that there is a way to filter out results that have a score below a
 given threshold or a way to filter out results based on score gap or
 anything similar. But I have no idea on how to do this...
 
 
 What is the best way to achieve what I want?
 
 Thank you for your help !
 
 Cheers,
 
 Alex


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Language Detection for Analysis?

2009-08-06 Thread Otis Gospodnetic
Bradford,

If I may:

Have a look at http://www.sematext.com/products/language-identifier/index.html
And/or http://www.sematext.com/products/multilingual-indexer/index.html

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Bradford Stephens bradfordsteph...@gmail.com
 To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
 Sent: Thursday, August 6, 2009 3:46:21 PM
 Subject: Language Detection for Analysis?
 
 Hey there,
 
 We're trying to add foreign language support into our new search
 engine -- languages like Arabic, Farsi, and Urdu (that don't work with
 standard analyzers). But our data source doesn't tell us which
 languages we're actually collecting -- we just get blocks of text. Has
 anyone here worked on language detection so we can figure out what
 analyzers to use? Are there commercial solutions?
 
 Much appreciated!
 
 -- 
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to improve search time?

2009-08-03 Thread Otis Gospodnetic
With such a large index be prepared to put it on a server with lots of RAM 
(even if you follow all the tips from the Wiki).
When reporting performance numbers, you really ought to tell us about your 
hardware, types of queries, etc.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: prashant ullegaddi prashullega...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Monday, August 3, 2009 12:33:46 AM
 Subject: How to improve search time?
 
 Hi,
 
 I've a single index of size 87GB containing around 50M documents. When I
 search for any query,
 best search time I observed was 8sec. And when query is expanded with
 synonyms, search takes
 minutes (~ 2-3min). Is there a better way to search so that overall search
 time reduces?
 
 Thanks,
 Prashant.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene for dynamic data retrieval

2009-08-02 Thread Otis Gospodnetic
Hi Satish,

Lucene doesn't enforce an index schema, so each document can have a different 
set of fields.  It sounds like you need to write a custom indexer that follows 
your custom rules and creates Lucene Documents with different Fields, depending 
on what you want indexed.

You also mention searching and retrieval of data from DB.  This, too, sounds 
like a custom search application - there is nothing in Lucene that uses a 
(R)DBMS to retrieve field values.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Findsatish findsat...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Friday, July 31, 2009 7:13:47 AM
 Subject: Lucene for dynamic data retrieval
 
 
 Hi All,
 I am new to Lucene and I am working on a search application.
 
 My application needs dynamic data retrieval from the database. That means,
 based on my previous step output, I need to retrieve entries from the DB for
 the next step.
 
 For example, if my search query contains Name field entry, I need to
 retrieve the Designations from the DB that are matched with the identified
 Name in the query.
 if there is no Name identified in the query, then I
 need to retrieve ALL the Designations from the DB.
 
 In the next step, if Designation is also identified in the query, then I
 need to retrieve the Departments from the DB that are matched with this
 Designation.
 if there is no Designation identified, then I need
 to retrieve ALL the Departments from the DB.
 
 Like this, there are around 6-7 steps, all are dependent on the previous
 step output.
 
 In this scenario, I would like to know whether I can use Lucene for creating
 the index? If so, How can I use it?
 
 Any help is highly appreciated.
 
 Thanks,
 Satish
 -- 
 View this message in context: 
 http://www.nabble.com/Lucene-for-dynamic-data-retrieval-tp24754777p24754777.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: most frquent term in the index

2009-07-24 Thread Otis Gospodnetic
Hello,

Here is a class you can use for that:

./contrib/miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: starz10de farag_ah...@yahoo.com
 To: java-user@lucene.apache.org
 Sent: Friday, July 24, 2009 4:54:47 PM
 Subject: most frquent term in the index
 
 
 How to get the most frequent terms in the index in descending order? 
 
 Thanks
 -- 
 View this message in context: 
 http://www.nabble.com/most-frquent-term-in-the-index-tp24651807p24651807.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Cosine similarity

2009-07-24 Thread Otis Gospodnetic
Yes, have a look at this:
http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Similarity.html

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: starz10de farag_ah...@yahoo.com
 To: java-user@lucene.apache.org
 Sent: Friday, July 24, 2009 4:50:22 PM
 Subject: Cosine similarity
 
 
 Does lucene use cosine smiliarity measure to measure the similarity between
 the query and the indexed documents?
 
 Thanks
 -- 
 View this message in context: 
 http://www.nabble.com/Cosine-similarity-tp24651759p24651759.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Loading an index into memory

2009-07-23 Thread Otis Gospodnetic
I haven't verified this myself, but I remember talking to somebody who tried 
MMapDirectory and compared it to simply using tmpfs (RAM FS).  The result was 
that MMapDirectory had some memory overhead, so putting the index on tmpfs was 
more memory-efficient.  I guess this person had read-only indices, so tmpfs was 
an option.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Uwe Schindler uschind...@pangaea.de
 To: java-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 9:47:24 AM
 Subject: RE: Loading an index into memory
 
 The size is in bytes and the RAMDirectory stores the bytes in bytes, so size
 is equal. I would suggest to not copy the dir into a RAMdirectory. It is
 better to use MMapDirectory in this case, as it swaps the files into
 address space like a normal OS swap file. The OS kernel will automatically
 swap needed parts into physical RAM. In this case the Java Heap is not
 wasted and only needed parts are swapped into RAM.
 
 -
 UWE SCHINDLER
 Webserver/Middleware Development
 PANGAEA - Publishing Network for Geoscientific and Environmental Data
 MARUM - University of Bremen
 Room 2500, Leobener Str., D-28359 Bremen
 Tel.: +49 421 218 65595
 Fax:  +49 421 218 65505
 http://www.pangaea.de/
 E-mail: uschind...@pangaea.de
 
  -Original Message-
  From: Dragon Fly [mailto:dragon-fly...@hotmail.com]
  Sent: Thursday, July 23, 2009 3:38 PM
  To: java-user@lucene.apache.org
  Subject: Loading an index into memory
  
  
  Hi,
  
  I have a question regarding RAMDirectory.  I have a 5 GB index on disk and
  it is opened like the following:
  
searcher = new IndexSearcher (new RAMDirectory (indexDirectory));
  
  Approximately how much memory is needed to load the index? 5GB of memory
  or 10GB because of Unicode? Does the entire index get loaded into memory
  or only parts of it? Thank you.
  
  
  _
  Windows LiveT HotmailR: Celebrate the moment with your favorite sports
  pics. Check it out.
  http://www.windowslive.com/Online/Hotmail/Campaign/QuickAdd?ocid=TXT_TAGLM
  _WL_QA_HM_sports_photos_072009cat=sports
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   4   5   6   7   >