Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-07 Thread Glen Newton
Thank-you. Glen On Sat, 6 Aug 2022 at 23:46, Tomoko Uchida wrote: > Hi Glen, > I verified your Jira/GitHub usernames and added a mapping. > > https://github.com/apache/lucene-jira-archive/commit/ae78d583b40f5bafa1f8ee09854294732dbf530b > > Tomoko > > > 20

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Glen Newton
jira: gnewton github: gnewton (github.com/gnewton) Thanks, Glen On Sat, 6 Aug 2022 at 14:11, Tomoko Uchida wrote: > Hi everyone. > > I wanted to let you know that we'll extend the deadline until the date the > migration is started (the date is not fixed yet). > Please let us know your

Re: Lucene 6.3 faceting documentation

2016-11-10 Thread Glen Newton
o source is up-to-date though. > > Shai > > On Thu, Nov 10, 2016 at 4:40 PM Glen Newton <glen.new...@gmail.com> wrote: > > > I am looking for documentation on Lucene faceting. The most recent > > documentation I can find is for 4.0.0 here: > > >

Lucene 6.3 faceting documentation

2016-11-10 Thread Glen Newton
I am looking for documentation on Lucene faceting. The most recent documentation I can find is for 4.0.0 here: http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html Is there more recent documentation for 6.3.0? Or 6.x? Thanks, Glen

Re: docid is just a signed int32

2016-08-19 Thread Glen Newton
> load a single document (or a fixed number of them) for every step. In > the case you call loadAll() there is a problem with memory. > > > > > 2016-08-19 15:39 GMT+02:00, Glen Newton <glen.new...@gmail.com>: > > Making docid an int64 is a non-trivial undertaki

Re: docid is just a signed int32

2016-08-19 Thread Glen Newton
Making docid an int64 is a non-trivial undertaking, and this work needs to be compared against the use cases and how compelling they are. That said, in the lifetime of most software projects a decision is made to break backward compatibility to move the project forward. When/if moving to int64

Re: docid is just a signed int32

2016-08-18 Thread Glen Newton
Or maybe it is time Lucene re-examined this limit. There are use cases out there where >2^31 does make sense in a single index (huge number of tiny docs). Also, I think the underlying hardware and the JDK have advanced to make this more defendable. Constructively, Glen On Thu, Aug 18, 2016 at

Re: Question about JoinUtil

2014-12-17 Thread Glen Newton
of your grouped data. This is really limiting if your relationships are truly many to many. Hope that helps, Greg On Tue, Dec 16, 2014 at 10:46 AM, Glen Newton glen.new...@gmail.com wrote: Anyone? On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton glen.new...@gmail.com wrote: Is there any reason

Re: Question about JoinUtil

2014-12-16 Thread Glen Newton
Anyone? On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton glen.new...@gmail.com wrote: Is there any reason JoinUtil (below) does not have a 'Query toQuery' available? I was wanting to filter on the 'to' side as well. I feel I am missing something here. To make sure this is not an XY problem, here

Question about JoinUtil

2014-12-11 Thread Glen Newton
Is there any reason JoinUtil (below) does not have a 'Query toQuery' available? I was wanting to filter on the 'to' side as well. I feel I am missing something here. To make sure this is not an XY problem, here is my use case: I have a many-to-many relationship. The left, join, and right 'table'

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Glen Newton
Hi Koji, Semantic vectors is here: http://code.google.com/p/semanticvectors/ It is a project that has been around for a number of years and used by many people (including me http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html ). If you could compare and contrast

Re: IndexWriter croaks on large file

2014-02-14 Thread Glen Newton
You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere john.cec...@oracle.com wrote: I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have

Index-time term expansion

2013-05-03 Thread Glen Newton
Hello, I know I've seen it go by on this list and elsewhere, but cannot seem to find it: can someone point me to the best way to do term expansions at indexing time. That is, when the sentence is: This foo is in my way And I somewhere: foo=bar|yak Lucene indexes something like: This

Re: Index-time term expansion

2013-05-03 Thread Glen Newton
Thanks :-) On Fri, May 3, 2013 at 2:31 PM, Alan Woodward a...@flax.co.uk wrote: Hi Glen, You want the SynonymFilter: http://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html Alan Woodward www.flax.co.uk On 3 May 2013, at 19:14, Glen

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread Glen Newton
I am in the process of upgrading LuSql from 2.x to 4.x and I am first going to 3.6 as the jump to 4.x was too big. I would suggest this to you. I think it is less work. Of course I am also able to offer LuSql to 3.6 users, so this is slightly different from your case. -Glen On Wed, Jan 9, 2013

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the _end_ of a span) I think all the things that I want to do, and the things that can

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Glen Newton
+10 These are the kind of things you can do in GATE[1] using annotations[2]. A VERY useful feature. -Glen [1]http://gate.ac.uk [2]http://gate.ac.uk/wiki/jape-repository/annotations.html On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any

Re: Lucene in Corpus Linguistics

2012-09-26 Thread Glen Newton
Yes, very interested. -- Quick scan: very cool work! +10 :-) Thanks, Glen Newton On Wed, Sep 26, 2012 at 9:59 AM, Carsten Schnober schno...@ids-mannheim.de wrote: Hi, in case someone is interested in an application of the Lucene indexing engine in the field of corpus linguistics rather

Re: Performance of storing data in Lucene vs other (No)SQL Databases

2012-05-18 Thread Glen Newton
Storing content in large indexes can significantly add to index time. The model of indexing fields only in Lucene and storing just a key, and then storing the content in some other container (DBMS, NoSql, etc) with the key as lookup is almost a necessity for this use case unless you have a

Re: Can I detect incorrect language selection after creating an index?

2012-02-27 Thread Glen Newton
Do the check _before_ indexing. Use https://code.google.com/p/language-detection/ to verify the language of the text document before you put it in the index. -Glen Newton http://zzzoot.blogspot.com/ On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin izavo...@caci.com wrote: Suppose I have a bunch

Re: Customizing indexing of large files

2012-02-27 Thread Glen Newton
I'd suggest writing a perl script or insert-favourite-scripting-language-here script to pre-filter this content out of the files before it gets to Lucene/Solr Or you could just grep for Data' andDescription (or is 'Description' multi-line)? -Glen Newton On Mon, Feb 27, 2012 at 11:55 AM, Prakash

Re: Customizing indexing of large files

2012-02-27 Thread Glen Newton
what to do in it. Regards, Prakash Bande Director - Hyperworks Enterprise Software Altair Eng. Inc. Troy MI Ph: 248-614-2400 ext 489 Cell: 248-404-0292 -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Monday, February 27, 2012 12:05 PM To: java-user

Re: Castle for Lucene/Solr?

2011-09-04 Thread Glen Newton
Caste -- Castle https://bitbucket.org/acunu http://support.acunu.com/entries/20216797-castle-build-instructions It looks very promising. It is a kernel module and I'm not sure it can run in user space, which I'd prefer. -Glen Newton On Sat, Sep 3, 2011 at 9:21 PM, Otis Gospodnetic

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-15 Thread Glen Newton
and http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm Finally (or before doing all of this! :-) ), do some profiling, both inside of Java, and of the AIX native heap using svmon (see Native Heap Exhaustion, p.135). -Glen Newton

Re: Index one huge text file

2011-07-22 Thread Glen Newton
Could you elaborate what you want to do with the index of large documents? Do you want to search at the document or sentence level? This can drive how to index this content. -Glen On Fri, Jul 22, 2011 at 10:52 AM, starz10de farag_ah...@yahoo.com wrote: Hi, I have one text file that contains

Re: Index one huge text file

2011-07-22 Thread Glen Newton
So to use Lucene-speak, each sentence is a document. I don't know how you are indexing and what code you are using (and what hardware, etc.), but you if you are not already, should consider multi-threading the indexing which should give you a significant indexing performance boost. -Glen On

Re: Lucene Architecture Site (Prototype)

2011-07-07 Thread Glen Newton
gmail interprets the closing asterisk as part of the URL, for all three URLs -- 404s You might want to add a space before the '*'... -glen On Thu, Jul 7, 2011 at 2:17 PM, Abhishek Rakshit abhis...@architexa.com wrote: Hey folks, We received great feedback on the Lucene Architecture site that

Re: Lucene on Multi-Processor/Core machines

2011-01-25 Thread Glen Newton
-threaded-query-lucene.html http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html Glen Newton On Tue, Jan 25, 2011 at 11:31 AM, Siraj Haider si...@jobdiva.com wrote: Hello there, I was looking for best practices for indexing/searching on a multi-processor/core machine

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Glen Newton
Where do you get your Lucene/Solr downloads from? [x] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. -Glen Newton

Re: Dataimport performance

2010-12-16 Thread Glen Newton
me. Thanks, Glen Newton http://zzzoot.blogspot.com -- Old LuSql benchmarks: http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James james.d...@ingrambook.com wrote: We have ~50 long-running SQL queries that need to be joined

IndexTank technology...

2010-11-11 Thread Glen Newton
Does anyone know what technology they are using: http://www.indextank.com/ Is it Lucene under the hood? Thanks, and apologies for cross-posting. -Glen http://zzzoot.blogspot.com -- - - To unsubscribe, e-mail:

Re: lucene usage on TREC data

2010-08-14 Thread Glen Newton
the ClueWeb collection http://trec.nist.gov/pubs/trec18/papers/arsc.WEB.pdf Expanding Queries Using Multiple Resources http://staff.science.uva.nl/~mdr/Publications/Files/trec2006-proceedings-genomics.pdf -Glen Newton http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html http

Re: Using categories with Lucene

2010-08-10 Thread Glen Newton
Hi Luan, Could you tell us the name and/or URL of this plugin so that the list might know about it? Thanks, Glen On 10 August 2010 12:21, Luan Cestari luan.cest...@gmail.com wrote: We would like to say thanks for the replies. We found a plugin in Nutch (the Creative Commons plugin) that

Re: Databases

2010-07-23 Thread Glen Newton
, in a Solr context. http://wiki.apache.org/solr/DataImportHandler Thanks, -Glen Newton LuSql author http://zzzoot.blogspot.com/ On 23 July 2010 15:46, manjula wijewickrema manjul...@gmail.com wrote: Hi, Normally, when I am building my index directory for indexed documents, I used to keep my

Re: Best practices for searcher memory usage?

2010-07-14 Thread Glen Newton
There are a number of strategies, on the Java or OS side of things: - Use huge pages[1]. Esp on 64 bit and lots of ram. For long running, large memory (and GC busy) applications, this has achieved significant improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article introducing and

Re: If you could have one feature in Lucene...

2010-02-27 Thread Glen Newton
Pluggable compression allowing for alternatives to gzip for text compression for storing. Specifically I am interested in bzip2[1] as implemented in Apache Commons Compress[2]. While bzip2 compression is considerable slower than gzip (although decompression is not too much slower than gzip) it

Re: If you could have one feature in Lucene...

2010-02-27 Thread Glen Newton
Hello Uwe. That will teach me for not keeping up with the versions! :-) So it is up to the application to keep track of what it used for compression. Understandable. Thanks! Glen On 27 February 2010 10:17, Uwe Schindler u...@thetaphi.de wrote: Hi Glen, Pluggable compression allowing for

Re: Exception while adding document in 3.0

2010-02-02 Thread Glen Newton
Documents cannot be re-used in v3.0? http://wiki.apache.org/lucene-java/ImproveIndexingSpeed -glen http://zzzoot.blogspot.com/ On 2 February 2010 02:55, Simon Willnauer simon.willna...@googlemail.com wrote: Ganesh, do you reuse your Document instances in any way or do you create new docs

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-18 Thread Glen Newton
, say when looking at their index with Luke. :)  Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Glen Newton glen.new...@gmail.com To: java-user@lucene.apache.org Sent: Tue

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
Could someone send me where the rationale for the removal of COMPRESSED fields is? I've looked at http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior but it is a little light on the 'why' of this change. My fault - of course -

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
/LUCENE-652 https://issues.apache.org/jira/browse/LUCENE-1960 Glen Newton wrote: Could someone send me where the rationale for the removal of COMPRESSED fields is? I've looked at http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0

Re: Lucene index write performance optimization

2009-11-10 Thread Glen Newton
You might try re-implementing, using ThreadPoolExecutor http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html glen 2009/11/10 Jamie Band ja...@stimulussoft.com: Hi There Our app spends alot of time waiting for Lucene to finish writing to the index. I'd like to

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
Disclosure: I am the author of LuSql. -Glen Newton http://zzzoot.blogspot.com/ http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Glen_Newton 2009/10/22 Paul Taylor paul_t...@fastmail.fm: I'm building a lucene index from a database, creating 1 about 1 million documents

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
This is basically what LuSql does. The time increases (8h to 30 min) are similar. Usually on the order of an order of magnitude. Oh, the comments suggesting most of the interaction is with the database? The answer is: it depends. With large Lucene documents: Lucene is the limiting factor

Re: Field with reader limitation arbitrary

2009-09-15 Thread Glen Newton
and/or tests if you have them. Cheers, Anthony On Mon, Sep 14, 2009 at 1:03 PM, Glen Newton glen.new...@gmail.com wrote: Hi, In 2.4.1, Field has 2 constructors that involve a Reader: public Field(String name,                  Reader reader) public Field(String name,                  Reader

Re: Field with reader limitation arbitrary

2009-09-15 Thread Glen Newton
I appreciate your explanation, but I think that the use case I described merits a deeper exploration: Scenario 1: 16 threads indexing; queue size = 1000; present api; need to store In this scenario, there are always 1000 Strings with all the contents of their respective files. Averaging 50k per

Field with reader limitation arbitrary

2009-09-14 Thread Glen Newton
, Reader reader, Field.Store store, Field.Index index, Field.TermVector termVector) Constructively, Glen Newton http://zzzoot.blogspot.com/ http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
In this project: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html I concatenate all the text of all of articles of a single journal into a single text file. This can create a text file that is 500MB in size. Lucene is OK in indexing files this size (in parallel even),

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
@lucene.apache.org] On Behalf Of Glen Newton Sent: Friday, September 11, 2009 9:53 AM To: java-user@lucene.apache.org Subject: Re: Indexing large files? - No answers yet... In this project: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html I concatenate all

Re: [EASY]How to change the demo of lucene143 into a multithread one?

2009-08-13 Thread Glen Newton
You are optimizing before the threads are finished adding to the index. I think this should work: IndexWriter writer = new IndexWriter(D:\\index, new StandardAnalyzer(), true); File file=new File(args[0]); Thread t1=new Thread(new IndexFiles(writer,file)); Thread t2=new Thread(new

Visualizing Semantic Journal Space (large scale) using full-text

2009-07-29 Thread Glen Newton
only the full-text (no metadata). For more info howto: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html Glen Newton -- - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

Re: New tool: LSql

2009-04-14 Thread Glen Newton
see you include Lucene v2.3 in your code...does it work correctly with indexes created on v2.4 as well? - Greg On Mon, Apr 13, 2009 at 6:49 PM, Glen Newton glen.new...@gmail.com wrote: As the creator of LuSql [http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] I would have

Re: Can I run Lucene in google app engine?

2009-04-13 Thread Glen Newton
Another solution is to have your application on the AppEngine, but the index is on another machine. Then the application 'proxies' the requests to the machine that has the index, which is using Solr [http://lucene.apache.org/solr/] or some other way to expose to the index to the web. Yes, this

Re: New tool: LSql

2009-04-13 Thread Glen Newton
As the creator of LuSql [http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] I would have hoped for a more creative (and more different) name. :-) -glen 2009/4/13 jonathan esposito jonathan.e...@gmail.com: I created a command-line tool in Java that allows the user to execute

Re: LuSQL download link error?

2009-04-02 Thread Glen Newton
Dear Shashi, It should work now. A temporary failure: our apologies. thanks, Glen 2009/4/2 Shashi Kant sk...@sloan.mit.edu: Hi all, I have been trying to get the latest version of LuSQL from the NRC.ca website but get 404s on the download links. I have written to the webmaster, but anyone

Re: People you might know ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Glen Newton
You might try looking in a list that talks about recommender systems. Google hits: - http://en.wikipedia.org/wiki/Recommendation_system - ACM Recommender Systems 2009 http://recsys.acm.org/ - A Guide to Recommender Systems http://www.readwriteweb.com/archives/recommender_systems.php 2009/3/17

Re: Merging database index with fulltext index

2009-03-01 Thread Glen Newton
I would suggest you try LuSql, which was designed specifically to index relational databases into Lucene. It has an extensive user manual/tutorial which has some complex examples involving multi-joins and sub-queries. I am the author of LuSql. LuSql home page:

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-19 Thread Glen Newton
InfoSystems (P) Ltd http://www.mapmyindia.com Glen Newton wrote: Could you give some configuration details: - Solaris version - Java VM version, heap size, and any other flags - disk setup You should also consider using huge pages (see http://zzzoot.blogspot.com/2009/02/java-mysql-increased

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Glen Newton
Could you give some configuration details: - Solaris version - Java VM version, heap size, and any other flags - disk setup You should also consider using huge pages (see http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html) I will also be posting performance gains using

Re: Visualization

2009-02-12 Thread Glen Newton
V1 of a project of mine, Ungava[1], which uses Lucene to index research articles and library catalog metadata, also uses Project Simile's Metaphor and Timeline. I have some simple examples using them: Here is the search for cell in articles:

Re: [ANN] Lucid Imagination

2009-01-26 Thread Glen Newton
Congrats good-luck on this new endeavour! -Glen :-) 2009/1/26 Grant Ingersoll gsing...@apache.org: Hi Lucene and Solr users, As some of you may know, Yonik, Erik, Sami, Mark and I teamed up with Marc Krellenstein to create a company to provide commercial support (with SLAs), training,

Re: clustering with compass terracotta

2009-01-15 Thread Glen Newton
There is a discussion here: http://www.terracotta.org/web/display/orgsite/Lucene+Integration Also of interest: Katta - distribute lucene indexes in a grid http://katta.wiki.sourceforge.net/ -glen http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html

Re: Help with installing Lucene

2009-01-07 Thread Glen Newton
I'm not sure if it's a better idea to use something like Solr or start from scratch and customize the application as I move forward. What do you think LuSql might be appropriate for your needs: LuSql is a high-performance, simple tool for indexing data held in a DBMS into a Lucene index. It can

Re: FastSSFuzzy for faster fuzzy queries in Lucene

2009-01-06 Thread Glen Newton
- Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/ - Paper: Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/ifi-2007.02.pdf - FastSimilarSearch.java http://fastss.csg.uzh.ch/FastSimilarSearch.java - Paper: Fast Similarity Search in Peer-to-Peer

Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
From what I understand: faceted browse is a taxonomy of depth =1 A taxonomy in general has an arbitrary depth: Example: Biological taxonomy: Kingdom Animalia Phylum Acanthocephala Class Archiacanthocephala Phylum Annelida Kingdom Fungi Phylum Ascomycota Class Ascomycetes

Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
I don't think this is an Open Source project: I couldn't find any source on the site and the only download is a jar with .class files... -glen 2008/12/10 John Wang [EMAIL PROTECTED]: www.browseengine.com -John On Wed, Dec 10, 2008 at 10:55 AM, Glen Newton [EMAIL PROTECTED] wrote: From what

Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
Oops. Thanks! :-) 2008/12/10 Gary Moore [EMAIL PROTECTED]: svn co https://bobo-browse.svn.sourceforge.net/svnroot/bobo-browse/trunk bobo-browse -Gary Glen Newton wrote: I don't think this is an Open Source project: I couldn't find any source on the site and the only download is a jar

Re: NIOFSDirectory

2008-12-05 Thread Glen Newton
want concurrent writes. -John On Thu, Dec 4, 2008 at 2:44 PM, Glen Newton [EMAIL PROTECTED] wrote: Am I missing something here? Why not use: IndexWriter writer = new IndexWriter(NIOFSDirectory.getDirectory(new File(filename), analyzer, true); Another question: is NIOFSDirectory

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
Sorrywhat version are we talking about? :-) thanks, Glen 2008/12/4 Yonik Seeley [EMAIL PROTECTED]: On Thu, Dec 4, 2008 at 4:11 PM, John Wang [EMAIL PROTECTED] wrote: Hi guys: We did some profiling and benchmarking: The thread contention on FSDIrectory is gone, and for the set of

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
at 4:32 PM, Glen Newton [EMAIL PROTECTED] wrote: Sorrywhat version are we talking about? :-) The current development version of Lucene allows you to directly instantiate FSDirectory subclasses. -Yonik thanks, Glen 2008/12/4 Yonik Seeley [EMAIL PROTECTED]: On Thu

Re: lucene nicking my memory ?

2008-12-03 Thread Glen Newton
Hi Magnus, Could you post the OS, version, RAM size, swapsize, Java VM version, hardware, #cores, VM command line parameters, etc? This can be very relevant. Have you tried other garbage collectors and/or tuning as described in http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html?

Merging indexes multicore/multithreading

2008-12-02 Thread Glen Newton
Let's say I have 8 indexes on a 4 core system and I want to merge them (inside a single vm instance). Is it better to do a single merge of all 8, or to in parallel threads merge in pairs, until there is only a single index left? I guess the question involves how multi-threaded merging is and if it

Lucene 2.3.1 vs 2.4 benchmarks using LuSql

2008-11-24 Thread Glen Newton
I have some simple indexing benchmarks comparing Lucene 2.3.1 with 2.4: http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html In the next couple of days I will be running benchmarks comparing Solr's DataImportHandler/JdbcDataSource indexing performance with LuSql and

Software Announcement: LuSql: Database to Lucene indexing

2008-11-17 Thread Glen Newton
an 86GB Lucene index in ~13 hours. http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql Glen Newton -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Global Field question (thread-safe)?

2008-11-06 Thread Glen Newton
I have a use case where I want all of my documents to have - in addition to their other fields - a single field=value. An example use is where I have multiple Lucene indexes that I search in parallel, but still need to distinguish them. Index 1: All documents have: source=a1 Index 2: All

Re: Global Field question (thread-safe)?

2008-11-06 Thread Glen Newton
Thanks! :-) 2008/11/6 Michael McCandless [EMAIL PROTECTED]: The field never changes across all docs? If so, this will work fine. Mike Glen Newton wrote: I have a use case where I want all of my documents to have - in addition to their other fields - a single field=value. An example

Document thread safe?

2008-10-31 Thread Glen Newton
Hello, I am using Lucene 2.3.1. I have concurrent threads adding Fields to the same Document, but getting some odd behaviour. Before going into too much depth, is Document thread-safe? thanks, Glen http://zzzoot.blogspot.com/ -- -

Re: Document thread safe?

2008-10-31 Thread Glen Newton
Yes, the problem goes away when I do the following: synchronized(doc) { doc.add(field); } Thanks. [I'll use a Lock to do this properly] -glen 2008/10/31 Yonik Seeley [EMAIL PROTECTED]: On Fri, Oct 31, 2008 at 11:53 AM, Glen Newton [EMAIL PROTECTED] wrote: I have concurrent threads adding

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
You might want to look at my indexing of 6.4 million PDF articles, full-text and metadata. It resulted in an 83GB index taking 20.5 hours to run. It uses multiple writers, is massively multithreaded. More info here: http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Mark Miller [EMAIL PROTECTED]: It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't much need to index to

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Michael McCandless [EMAIL PROTECTED]: Mark Miller wrote: Glen Newton wrote: 2008/10/23 Mark Miller [EMAIL PROTECTED]: It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I

Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
Sorry, could you explain what you mean by a link map over lucene results? thanks, -glen 2008/10/16 Darren Govoni [EMAIL PROTECTED]: Hi, Has anyone created a link map over lucene results or know of a link describing the process? If not, I would like to build one to contribute. Also, I read

Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
vectors in Lucene, but I've never used TFV's before. And can then be limited to just a set of results. HTH, Darren On Thu, 2008-10-16 at 14:09 -0400, Glen Newton wrote: Sorry, could you explain what you mean by a link map over lucene results? thanks, -glen 2008/10/16 Darren Govoni [EMAIL

Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
See also: http://zzzoot.blogspot.com/2007/10/drill-clouds-for-search-refinement-id.html and http://zzzoot.blogspot.com/2007/10/tag-cloud-inspired-html-select-lists.html -glen 2008/10/16 Glen Newton [EMAIL PROTECTED]: Yes, tag clouds. I've implemented them using Lucene here for NRC Research

Re: Indexing Scalability, Multiwriter?

2008-10-10 Thread Glen Newton
IndexWriter is thread-safe and has been for a while (http://www.mail-archive.com/[EMAIL PROTECTED]/msg00157.html) so you don't have to worry about that. As reported in my blog in April (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html) but perhaps not explicitly

Re: could I implement this scenario?

2008-09-19 Thread Glen Newton
I think it is not good idea to use lucene as storage, it is just index. I strongly disagree with this position. To qualify my disagreement: yes, you should not use Lucene as your primary storage for your data in your organization. But, for a particular application, taking content from your

Re: Tree search

2008-08-07 Thread Glen Newton
There are a number of ways to do this. Here is one: Lose the parentid field (unless you have other reasons to keep it). Add a field fullName, and a field called depth : doc1 fullName: state depth: 0 doc2 fullName: state/department depth:1 doc3 fullName: state/department/Boston depth: 2 doc4

Re: Scaling

2008-07-16 Thread Glen Newton
A subset of your questions are answered (or at least examined) in my postings on multi-thread queries on a multiple-core single system: http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html -Glen

Re: How to make documents clustering and topic classification with lucene

2008-07-08 Thread Glen Newton
Use Carrot2: http://project.carrot2.org/ For Lucene + Carrot2: http://project.carrot2.org/faq.html#lucene-integration -glen 2008/7/7 Ariel [EMAIL PROTECTED]: Hi everybody: Do you have Idea how to make how to make documents clustering and topic classification using lucene ??? Is there

Re: Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-13 Thread Glen Newton
Lutan, Yes, no problem. I am away at a conference next week but plan to release the code the following week. Is this OK for you? thanks, Glen 2008/6/13 lutan [EMAIL PROTECTED]: TO: Glen Newton Could I get your test code or code architecture for study. I have try to using

Re: Concurrent query benchmarks

2008-06-10 Thread Glen Newton
work and I will clean things up a bit, write a little documentation. -Glen Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Glen Newton [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, June 10, 2008 12:51:41 AM Subject

Re: Concurrent query benchmarks

2008-06-10 Thread Glen Newton
/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Jun 9, 2008 at 3:51 PM, Glen Newton [EMAIL PROTECTED] wrote: A number of people have asked about query benchmarks. I have

Multi-language support within a single index

2008-06-05 Thread Glen Newton
? :-) Is this something that is already being talked about/looked in to/being implemented? :-) thanks, Glen Newton http://zzzoot.blogspot.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi-language support within a single index

2008-06-05 Thread Glen Newton
indexing and querying out-of-the-box. Best Erick On Thu, Jun 5, 2008 at 12:14 PM, Glen Newton [EMAIL PROTECTED] wrote: I would like to be able to get multi-language support within a single index. I would appreciate input on what I am suggesting: Assuming that you want something like

Re: How to add PageRank score with lucene's relevant score in sorting

2008-05-28 Thread Glen Newton
You should consider keeping the PageRank (and any other more dynamic data) in a separate index (with the documents in the same oder as your bigger, more static index) and then use a ParallelReader on both of them. See:

Re: Improving search performance

2008-05-22 Thread Glen Newton
2008/5/22 Otis Gospodnetic [EMAIL PROTECTED]: Some quick feedback. Those are all very expensive queries (wildcards and ranges). The first thing I'd do is try without Hibernate Search (to make sure HS is not the bottleneck). 100 threads is a lot, I'm guessing you are reusing your

Re: Lucene Indexing structure

2008-05-02 Thread Glen Newton
Vaijanath, I think I would do things in a different fashion: Lucene default distance metric is based on tf/idf and the cosine model, i.e. the frequencies of items. I believe the values that you are adding as Fields are the values in n-space for each of these image-based attributes. I don't

Re: Does Lucene Supports Billions of data

2008-04-30 Thread Glen Newton
I have created Indexes with 1.5 billion documents. It was experimental: I took an index with 25 million documents, and merged it with itself many times. While not definitive as there were only 25m unique documents that were duplicated, it did prove that Lucene should be able to handle this number

  1   2   >