Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-07 Thread Glen Newton
Thank-you.

Glen

On Sat, 6 Aug 2022 at 23:46, Tomoko Uchida 
wrote:

> Hi Glen,
> I verified your Jira/GitHub usernames and added a mapping.
>
> https://github.com/apache/lucene-jira-archive/commit/ae78d583b40f5bafa1f8ee09854294732dbf530b
>
> Tomoko
>
>
> 2022年8月7日(日) 3:37 Glen Newton :
>
> > jira: gnewton
> > github: gnewton  (github.com/gnewton)
> >
> > Thanks,
> > Glen
> >
> >
> >
> > On Sat, 6 Aug 2022 at 14:11, Tomoko Uchida  >
> > wrote:
> >
> > > Hi everyone.
> > >
> > > I wanted to let you know that we'll extend the deadline until the date
> > the
> > > migration is started (the date is not fixed yet).
> > > Please let us know your Jira/Github usernames if you don't see
> mapping(s)
> > > for your account in this file:
> > >
> > >
> >
> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified
> > >
> > > Tomoko
> > >
> > >
> > > 2022年8月7日(日) 1:36 Baris Kazar :
> > >
> > > > Thank You Thank You
> > > > Best regards
> > > > 
> > > > From: Michael McCandless 
> > > > Sent: Saturday, August 6, 2022 11:29:25 AM
> > > > To: Baris Kazar 
> > > > Cc: java-user@lucene.apache.org 
> > > > Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account
> ids
> > > > before Thursday August 4 midnight (in your local time)
> > > >
> > > > OK done:
> > > >
> > >
> >
> https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1
> > > > <
> > > >
> > >
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1__;!!ACWV5N9M2RV99hQ!OJffdSKrjdfY7VYGcAVGsx4rKHPICvgac4eOcXOf1fnT7u9fJ2RSu9toYPgowHx72UC33Ixg1s1BLKR6GBFgnw$
> > > > >
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com<
> > > >
> > >
> >
> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!OJffdSKrjdfY7VYGcAVGsx4rKHPICvgac4eOcXOf1fnT7u9fJ2RSu9toYPgowHx72UC33Ixg1s1BLKQULWvYcw$
> > > > >
> > > >
> > > >
> > > > On Sat, Aug 6, 2022 at 10:29 AM Baris Kazar  > > > <mailto:baris.ka...@oracle.com>> wrote:
> > > > I think so.
> > > > Best regards
> > > > 
> > > > From: Michael McCandless  > > > luc...@mikemccandless.com>>
> > > > Sent: Saturday, August 6, 2022 10:12 AM
> > > > To: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>
> <
> > > > java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>>
> > > > Cc: Baris Kazar  baris.ka...@oracle.com
> > >>
> > > > Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account
> ids
> > > > before Thursday August 4 midnight (in your local time)
> > > >
> > > > Thanks Baris,
> > > >
> > > > And your Jira ID is bkazar right?
> > > >
> > > > Mike
> > > >
> > > > On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar  > > > <mailto:baris.ka...@oracle.com>> wrote:
> > > > My github username is bmkazar
> > > > can You please register me?
> > > > Best regards
> > > > 
> > > > From: Michael McCandless  > > > luc...@mikemccandless.com>>
> > > > Sent: Saturday, August 6, 2022 6:05:51 AM
> > > > To: d...@lucene.apache.org<mailto:d...@lucene.apache.org> <
> > > > d...@lucene.apache.org<mailto:d...@lucene.apache.org>>
> > > > Cc: Lucene Users  > > > java-user@lucene.apache.org>>; java-dev  > > > <mailto:java-...@lucene.apache.org>>
> > > > Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account
> ids
> > > > before Thursday August 4 midnight (in your local time)
> > > >
> > > > Hi Adam, I added your linked accounts here:
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Glen Newton
jira: gnewton
github: gnewton  (github.com/gnewton)

Thanks,
Glen



On Sat, 6 Aug 2022 at 14:11, Tomoko Uchida 
wrote:

> Hi everyone.
>
> I wanted to let you know that we'll extend the deadline until the date the
> migration is started (the date is not fixed yet).
> Please let us know your Jira/Github usernames if you don't see mapping(s)
> for your account in this file:
>
> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified
>
> Tomoko
>
>
> 2022年8月7日(日) 1:36 Baris Kazar :
>
> > Thank You Thank You
> > Best regards
> > 
> > From: Michael McCandless 
> > Sent: Saturday, August 6, 2022 11:29:25 AM
> > To: Baris Kazar 
> > Cc: java-user@lucene.apache.org 
> > Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids
> > before Thursday August 4 midnight (in your local time)
> >
> > OK done:
> >
> https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1
> > <
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1__;!!ACWV5N9M2RV99hQ!OJffdSKrjdfY7VYGcAVGsx4rKHPICvgac4eOcXOf1fnT7u9fJ2RSu9toYPgowHx72UC33Ixg1s1BLKR6GBFgnw$
> > >
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com<
> >
> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!OJffdSKrjdfY7VYGcAVGsx4rKHPICvgac4eOcXOf1fnT7u9fJ2RSu9toYPgowHx72UC33Ixg1s1BLKQULWvYcw$
> > >
> >
> >
> > On Sat, Aug 6, 2022 at 10:29 AM Baris Kazar  > > wrote:
> > I think so.
> > Best regards
> > 
> > From: Michael McCandless  > luc...@mikemccandless.com>>
> > Sent: Saturday, August 6, 2022 10:12 AM
> > To: java-user@lucene.apache.org <
> > java-user@lucene.apache.org>
> > Cc: Baris Kazar mailto:baris.ka...@oracle.com>>
> > Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids
> > before Thursday August 4 midnight (in your local time)
> >
> > Thanks Baris,
> >
> > And your Jira ID is bkazar right?
> >
> > Mike
> >
> > On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar  > > wrote:
> > My github username is bmkazar
> > can You please register me?
> > Best regards
> > 
> > From: Michael McCandless  > luc...@mikemccandless.com>>
> > Sent: Saturday, August 6, 2022 6:05:51 AM
> > To: d...@lucene.apache.org <
> > d...@lucene.apache.org>
> > Cc: Lucene Users  > java-user@lucene.apache.org>>; java-dev  > >
> > Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids
> > before Thursday August 4 midnight (in your local time)
> >
> > Hi Adam, I added your linked accounts here:
> >
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLk1DO04g$
> >
> > And Tomoko added Rushabh's linked accounts here:
> >
> >
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nITwUFX0A$
> >
> > Keep the linked accounts coming!
> >
> > Mike
> >
> > On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
> > mailto:rushabh.s...@salesforce.com
> >.invalid>
> > wrote:
> >
> > > Hi,
> > > My mapping is:
> > > JiraName,GitHubAccount,JiraDispName
> > > shahrs87, shahrs87, Rushabh Shah
> > >
> > > Thank you Tomoko and Mike for all of your hard work.
> > >
> > >
> > >
> > >
> > > On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > >> Hello Lucene users, contributors and developers,
> > >>
> > >> If you have used Lucene's Jira and you have a GitHub account as well,
> > >> please check whether your user id mapping is in this file:
> > >>
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLjA_KarQ$
> > >>
> > >> If not, please reply to this email and we will try to add you.
> > >>
> > >> Please forward this email to anyone you know might be impacted and who
> > >> might not be tracking the Lucene lists.
> > >>
> > >>
> > >> Full details:
> > >>
> > >> The Lucene project will soon migrate from Jira to GitHub for issue
> > >> tracking.
> > >>
> > >> There have been discussions, votes, a migration tool created /
> iterated
> > >> (thanks to Tomoko Uchida's incredibly hard work), all iterating on
> > Lucene's
> > >> dev list.
> > >>
> > >> When we 

Re: Lucene 6.3 faceting documentation

2016-11-10 Thread Glen Newton
Great! Thanks so much!  :-)

Glen

On Thu, Nov 10, 2016 at 9:47 AM, Shai Erera  wrote:

> We've removed the userguide a long time ago. We have a set of example files
> under lucene-demo, e.g. here
> https://lucene.apache.org/core/6_3_0/demo/src-html/org/
> apache/lucene/demo/facet/
> .
>
> Also, you can read some blog posts, start here:
> http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html and then
> http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html, though
> the
> code examples may be outdated. The lucene-demo source is up-to-date though.
>
> Shai
>
> On Thu, Nov 10, 2016 at 4:40 PM Glen Newton  wrote:
>
> > I am looking for documentation on Lucene faceting. The most recent
> > documentation I can find is for 4.0.0 here:
> >
> > http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/
> facet/doc-files/userguide.html
> >
> > Is there more recent documentation for 6.3.0? Or 6.x?
> >
> > Thanks,
> > Glen
> >
>


Lucene 6.3 faceting documentation

2016-11-10 Thread Glen Newton
I am looking for documentation on Lucene faceting. The most recent
documentation I can find is for 4.0.0 here:
http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html

Is there more recent documentation for 6.3.0? Or 6.x?

Thanks,
Glen


Re: docid is just a signed int32

2016-08-19 Thread Glen Newton
I was referring to memory (RAM).

We have machines running right now with 1TB _RAM_ and will be getting
machines with 3TB RAM (Dell R830 with 48 64GM DIMMs) (Sorry, I was
incorrect when I said we were running the 3TB machines _now_).

Glen



On Fri, Aug 19, 2016 at 9:56 AM, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> wrote:

> ah :)
>
> "with 3TB of ram (we have these running), int64 for >2^32 documents in a
> single index should not be a problem"
>
> Maybe i m reasoning in bad way but normally the size of storage is not
> the size of memory.
> I dont know lucene in the deep, but i would aspect lucene index is
> scanning a block step by step, not all in memory. For this reason in a
> previous post, i mentioned about possibility to use iterator instead
> array, because array load in memory all the results,instead iterator
> load a single document (or a fixed number of them) for every step. In
> the case you call loadAll() there is a problem with memory.
>
>
>
>
> 2016-08-19 15:39 GMT+02:00, Glen Newton :
> > Making docid an int64 is a non-trivial undertaking, and this work needs
> to
> > be compared against the use cases and how compelling they are.
> >
> > That said, in the lifetime of most software projects a decision is made
> to
> > break backward compatibility to move the project forward.
> > When/if moving to int64 happens, it will be one of these moments. It is
> not
> > a Bad Thing (necessarily).  :-)
> >
> > And for use cases, if I am running a commercial JVM on a 64 core machine
> > with 3TB of ram (we have these running), int64 for >2^32 documents in a
> > single index should not be a problem...  :-)
> >
> > glen
> >
> > On Fri, Aug 19, 2016 at 4:43 AM, Adrien Grand  wrote:
> >
> >> Le ven. 19 août 2016 à 03:32, Trejkaz  a écrit :
> >>
> >> > But hang on:
> >> > * TopDocs#merge still returns a TopDocs.
> >> > * TopDocs still uses an array of ScoreDoc.
> >> > * ScoreDoc still uses an int doc ID.
> >> >
> >>
> >> This is why ScoreDoc has a `shardId` so that you can know which index a
> >> document comes from.
> >>
> >> I'm not saying we should not switch to long doc ids, but as outlined in
> >> some other responses it would be a challenging change.
> >>
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: docid is just a signed int32

2016-08-19 Thread Glen Newton
Making docid an int64 is a non-trivial undertaking, and this work needs to
be compared against the use cases and how compelling they are.

That said, in the lifetime of most software projects a decision is made to
break backward compatibility to move the project forward.
When/if moving to int64 happens, it will be one of these moments. It is not
a Bad Thing (necessarily).  :-)

And for use cases, if I am running a commercial JVM on a 64 core machine
with 3TB of ram (we have these running), int64 for >2^32 documents in a
single index should not be a problem...  :-)

glen

On Fri, Aug 19, 2016 at 4:43 AM, Adrien Grand  wrote:

> Le ven. 19 août 2016 à 03:32, Trejkaz  a écrit :
>
> > But hang on:
> > * TopDocs#merge still returns a TopDocs.
> > * TopDocs still uses an array of ScoreDoc.
> > * ScoreDoc still uses an int doc ID.
> >
>
> This is why ScoreDoc has a `shardId` so that you can know which index a
> document comes from.
>
> I'm not saying we should not switch to long doc ids, but as outlined in
> some other responses it would be a challenging change.
>


Re: docid is just a signed int32

2016-08-18 Thread Glen Newton
Or maybe it is time Lucene re-examined this limit.

There are use cases out there where >2^31 does make sense in a single index
(huge number of tiny docs).

Also, I think the underlying hardware and the JDK have advanced to make
this more defendable.

Constructively,
Glen


On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand  wrote:

> No, IndexWriter enforces that the number of documents cannot go over
> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> BaseCompositeReader computes the number of documents in a long variable and
> ensures it is less than 2^31, so you cannot have indexes that contain more
> than 2^31 documents.
>
> Larger collections should be written to multiple shards and use
> TopDocs.merge to merge results.
>
> Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> cristian.lorenze...@gmail.com> a écrit :
>
> > docid is a signed int32 so it is not so big, but really docid seams not a
> > primary key unmodifiable but a temporary id for the view related to a
> > specific search.
> >
> > So repository can contains more than 2^31 documents.
> >
> > My deduction is correct ? is there a maximum size for lucene index?
> >
>


Re: Question about JoinUtil

2014-12-17 Thread Glen Newton
Hi Gregory,

Thanks for your reply. In reading it, I realized that one side of my
relational join wasn't that large, and I could bring it in as a couple
of fields to the main document without any penalty, so my need to join
two different document types then goes away.

Thanks!  :-)
Glen



On Tue, Dec 16, 2014 at 7:49 PM, Gregory Dearing  wrote:
> Glen,
>
> Lucene isn't relational at heart and may not be the right tool for
> what you're trying to accomplish. Note that JoinQuery doesn't join
> 'left' and 'right' answers; rather it transforms a 'left' answerset
> into a 'right' answerset.
>
> JoinQuery is able to perform this transformation with a single extra
> search, which wouldn't be possible if it accepted a 'toQuery'
> argument.
>
>
> That being said, here are some suggestions...
>
> 1. If all you really need is data from the 'right' set of answers (the
> joined TO set), then you can just add more queries to perform
> right-hand filtering.
>
>createJoinQuery(...) AND TermQuery("country", "CA*")
>
> Note that 'left.name' in your SQL example is no longer available.
>
> 2. If you really need to filter both sides, and you need to return
> data from both sides, it probably requires some programming.  In
> pseudo-code...
>
>   leftAnswerSet = searcher.search(fromQuery)
>
>   foreach leftAnswer in leftAnswerSet {
> rightAnswers = searcher.search(leftAnswer AND TermQuery("country", "CA*"))
> results.add([leftAnswer, rightAnswers])
>   }
>
> This is obviously not very efficient, but I think it probably
> represents what JoinQuery would look like if it allowed a 'toQuery'
> capability and returned data from both sides of the join.
>
> 3. If you can denormalize your data into hierarchies, then you could
> use index-time joining (BlockJoin) for better performance and easier
> collecting of your grouped data.  This is really limiting if your
> relationships are truly many to many.
>
> Hope that helps,
> Greg
>
>
> On Tue, Dec 16, 2014 at 10:46 AM, Glen Newton  wrote:
>> Anyone?
>>
>> On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton  wrote:
>>> Is there any reason JoinUtil (below) does not have a 'Query toQuery'
>>> available? I was wanting to filter on the 'to' side as well. I feel I
>>> am missing something here.
>>>
>>> To make sure this is not an XY problem, here is my use case:
>>>
>>> I have a many-to-many relationship. The left, join, and right 'table'
>>> objects are all indexed in the same lucene index, with a field 'type'
>>> to distinguish them.
>>>
>>> I need to do something like this:
>>> select left.name, right.country from left, join, right where
>>> left.type="fooType" and right.type="barType" and join.leftId=left.id
>>> and join.rightId=right.id and left.name="Fred*" and
>>> right.country="Ca*"
>>>
>>> Is JoinUtil the way to go?
>>> Or should I roll my own?
>>>Or am I indexing/using-Lucene incorrectly, thinking relational when
>>> a different way to index or query would be better in an idiomatic
>>> Lucene manner?  :-)
>>>
>>>
>>> Thanks,
>>> Glen
>>>
>>> https://lucene.apache.org/core/4_10_2/join/org/apache/lucene/search/join/JoinUtil.html
>>>
>>> public static Query createJoinQuery(String fromField,
>>> boolean multipleValuesPerDocument,
>>> String toField,
>>> Query fromQuery,
>>> IndexSearcher fromSearcher,
>>> ScoreMode scoreMode)
>>>  throws IOException
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about JoinUtil

2014-12-16 Thread Glen Newton
Anyone?

On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton  wrote:
> Is there any reason JoinUtil (below) does not have a 'Query toQuery'
> available? I was wanting to filter on the 'to' side as well. I feel I
> am missing something here.
>
> To make sure this is not an XY problem, here is my use case:
>
> I have a many-to-many relationship. The left, join, and right 'table'
> objects are all indexed in the same lucene index, with a field 'type'
> to distinguish them.
>
> I need to do something like this:
> select left.name, right.country from left, join, right where
> left.type="fooType" and right.type="barType" and join.leftId=left.id
> and join.rightId=right.id and left.name="Fred*" and
> right.country="Ca*"
>
> Is JoinUtil the way to go?
> Or should I roll my own?
>Or am I indexing/using-Lucene incorrectly, thinking relational when
> a different way to index or query would be better in an idiomatic
> Lucene manner?  :-)
>
>
> Thanks,
> Glen
>
> https://lucene.apache.org/core/4_10_2/join/org/apache/lucene/search/join/JoinUtil.html
>
> public static Query createJoinQuery(String fromField,
> boolean multipleValuesPerDocument,
> String toField,
> Query fromQuery,
> IndexSearcher fromSearcher,
> ScoreMode scoreMode)
>  throws IOException

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Question about JoinUtil

2014-12-11 Thread Glen Newton
Is there any reason JoinUtil (below) does not have a 'Query toQuery'
available? I was wanting to filter on the 'to' side as well. I feel I
am missing something here.

To make sure this is not an XY problem, here is my use case:

I have a many-to-many relationship. The left, join, and right 'table'
objects are all indexed in the same lucene index, with a field 'type'
to distinguish them.

I need to do something like this:
select left.name, right.country from left, join, right where
left.type="fooType" and right.type="barType" and join.leftId=left.id
and join.rightId=right.id and left.name="Fred*" and
right.country="Ca*"

Is JoinUtil the way to go?
Or should I roll my own?
   Or am I indexing/using-Lucene incorrectly, thinking relational when
a different way to index or query would be better in an idiomatic
Lucene manner?  :-)


Thanks,
Glen

https://lucene.apache.org/core/4_10_2/join/org/apache/lucene/search/join/JoinUtil.html

public static Query createJoinQuery(String fromField,
boolean multipleValuesPerDocument,
String toField,
Query fromQuery,
IndexSearcher fromSearcher,
ScoreMode scoreMode)
 throws IOException

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [ANN] word2vec for Lucene

2014-11-20 Thread Glen Newton
Hi Koji,

Semantic vectors is here: http://code.google.com/p/semanticvectors/

It is a project that has been around for a number of years and used by many
people (including me
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
).

If you could compare and contrast word2vec with Semantic Vectors, this
would allow many of us to understand where/when we might want to use
word2vec.

Thank-you,
Glen

On Thu, Nov 20, 2014 at 10:24 AM, Koji Sekiguchi  wrote:

> Hi Paul,
>
> I cannot compare it to SemanticVectors as I don't know SemanticVectors.
> But word vectors that are produced by word2vec have interesting properties.
>
> Here is the description of the original word2vec web site:
>
>
> https://code.google.com/p/word2vec/#Interesting_properties_of_the_word_vectors
> Interesting properties of the word vectors
> It was recently shown that the word vectors capture many linguistic
> regularities, for example vector
> operations vector('Paris') - vector('France') + vector('Italy') results in
> a vector that is very
> close to vector('Rome'), and vector('king') - vector('man') +
> vector('woman') is close to
> vector('queen')
>
> Thanks,
>
> Koji
>
>
> (2014/11/20 20:01), Paul Libbrecht wrote:
> > Hello Koji,
> >
> > how would you compare that to SemanticVectors?
> >
> > paul
> >
> > On 20 nov. 2014, at 10:10, Koji Sekiguchi  wrote:
> >
> >> Hello,
> >>
> >> It's my pleasure to share that I have an interesting tool "word2vec for
> Lucene"
> >> available at https://github.com/kojisekig/word2vec-lucene .
> >>
> >> As you can imagine, you can use "word2vec for Lucene" to extract word
> vectors from Lucene index.
> >>
> >> Thank you,
> >>
> >> Koji
> >> --
> >>
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
> --
>
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: IndexWriter croaks on large file

2014-02-14 Thread Glen Newton
You should consider making each _line_ of the log file a (Lucene)
document (assuming it is a log-per-line log file)

-Glen

On Fri, Feb 14, 2014 at 4:12 PM, John Cecere  wrote:
> I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At
> any rate, I don't have control over the size of the documents that go into
> my database. Sometimes my customer's log files end up really big. I'm
> willing to have huge indexes for these things.
>
> Wouldn't just changing from int to long for the offsets solve the problem ?
> I'm sure it would probably have to be changed in a lot of places, but why
> impose such a limitation ? Especially since it's using an InputStream and
> only dealing with a block of data at a time.
>
> I'll take a look at your suggestion.
>
> Thanks,
> John
>
>
> On 2/14/14 3:20 PM, Michael McCandless wrote:
>>
>> Hmm, why are you indexing such immense documents?
>>
>> In 3.x Lucene never sanity checked the offsets, so we would silently
>> index negative (int overflow'd) offsets into e.g. term vectors.
>>
>> But in 4.x, we now detect this and throw the exception you're seeing,
>> because it can lead to index corruption when you index the offsets
>> into the postings.
>>
>> If you really must index such enormous documents, maybe you could
>> create a custom tokenizer  (derived from StandardTokenizer) that
>> "fixes" the offset before setting them?  Or maybe just doesn't even
>> set them.
>>
>> Note that position can also overflow, if your documents get too large.
>>
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Feb 14, 2014 at 1:36 PM, John Cecere 
>> wrote:
>>>
>>> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a
>>> file >
>>> 2GB in size, it dies with the following exception:
>>>
>>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>>> endOffset must be >= startOffset,
>>> startOffset=-2147483648,endOffset=-2147483647
>>>
>>> Essentially, I'm doing this:
>>>
>>> Directory directory = new MMapDirectory(indexPath);
>>> Analyzer analyzer = new StandardAnalyzer();
>>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
>>> analyzer);
>>> IndexWriter iw = new IndexWriter(directory, iwc);
>>>
>>> InputStream is = ;
>>> InputStreamReader reader = new InputStreamReader(is);
>>>
>>> Document doc = new Document();
>>> doc.add(new StoredField("fileid", fileid));
>>> doc.add(new StoredField("pathname", pathname));
>>> doc.add(new TextField("content", reader));
>>>
>>> iw.addDocument(doc);
>>>
>>> It's the IndexWriter addDocument method that throws the exception. In
>>> looking at the Lucene source code, it appears that the offsets being used
>>> internally are int, which makes it somewhat obvious why this is
>>> happening.
>>>
>>> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
>>> capable of handling a file over 2GB in this manner. What has changed and
>>> how
>>> do I get around this ? Is Lucene no longer capable of handling files this
>>> large, or is there some other way I should be doing this ?
>>>
>>> Here's the full stack trace sans my code:
>>>
>>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>>> endOffset must be >= startOffset,
>>> startOffset=-2147483648,endOffset=-2147483647
>>>  at
>>>
>>> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
>>>  at
>>>
>>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
>>>  at
>>>
>>> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
>>>  at
>>>
>>> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>>>  at
>>>
>>> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>>>  at
>>>
>>> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>>>  at
>>>
>>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>>>  at
>>>
>>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
>>>  at
>>>
>>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
>>>  at
>>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
>>>  at
>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
>>>  at
>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
>>>
>>> Thanks,
>>> John
>>>
>>> --
>>> John Cecere
>>> Principal Engineer - Oracle Corporation
>>> 732-987-4317 / john.cec...@oracle.com
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> 

Re: Index-time term expansion

2013-05-03 Thread Glen Newton
Thanks  :-)

On Fri, May 3, 2013 at 2:31 PM, Alan Woodward  wrote:
> Hi Glen,
>
> You want the SynonymFilter: 
> http://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 3 May 2013, at 19:14, Glen Newton wrote:
>
>> Hello,
>>
>> I know I've seen it go by on this list and elsewhere, but cannot seem
>> to find it: can someone point me to the best way to do term expansions
>> at indexing time.
>>
>> That is, when the sentence is: "This foo is in my way"
>> And I somewhere: foo=bar|yak
>> Lucene indexes something like:
>>  "This foo|bar|yak is in my way"
>>
>> where foo, bar and yak all share the same offset.
>> Oh, interested in versions 3.6 and/or 4.2
>>
>> Thanks,
>> Glen
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index-time term expansion

2013-05-03 Thread Glen Newton
Hello,

I know I've seen it go by on this list and elsewhere, but cannot seem
to find it: can someone point me to the best way to do term expansions
at indexing time.

That is, when the sentence is: "This foo is in my way"
And I somewhere: foo=bar|yak
Lucene indexes something like:
  "This foo|bar|yak is in my way"

where foo, bar and yak all share the same offset.
Oh, interested in versions 3.6 and/or 4.2

Thanks,
Glen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread Glen Newton
I am in the process of upgrading LuSql from 2.x to 4.x and I am first
going to 3.6 as the jump to 4.x was too big.
 I would suggest this to you. I think it is less work.
Of course I am also able to offer LuSql to 3.6 users, so this is
slightly different from your case.

-Glen

On Wed, Jan 9, 2013 at 4:58 PM, saisantoshi  wrote:
> Are there any best practices that we can follow? We want to get to the latest
> version and am thinking if we can directly go from 2.4.0 to 4.x (as supposed
> to 2.x - 3.x and 3.x - 4.x)? so that it will not only save time but also
> testing cycle at each migration hop.
>
> Are there any limitations in directly upgrading from 2.x - 4.x? Is this
> allowed?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Upgrade-Lucene-to-latest-version-4-0-from-2-4-0-tp4031956p4032038.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Cool! Sounds great!  :-)

Any pointers to a (Lucene) example that attaches a payload to a
start..end span that is more than one token?

thanks,
-Glen

On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog  wrote:
> I should not have added that note. The Opennlp patch gives a concrete
> example of adding an annotation to text.
>
>
> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>
>> It is not clear this is exactly what is needed/being discussed.
>>
>>  From the issue:
>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>> the same position."
>>
>> This adds it to a token, not a span. 'same position' does not suggest
>> it also records the end position.
>>
>> -Glen
>>
>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog  wrote:
>>>
>>> Parts-of-speech is available now, in the indexer.
>>>
>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>> Apache
>>> project for natural-language processing.
>>>
>>> Some parts are in Solr that could be in Lucene.
>>>
>>> https://issues.apache.org/jira/browse/lucene-2899
>>>
>>>
>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>>>
>>>>>> Is there any (preliminary) code checked in somewhere that I can look
>>>>>> at,
>>>>>> that would help me understand the practical issues that would need to
>>>>>> be
>>>>>> addressed?
>>>>>
>>>>> Maybe we can make this more concrete: what new attribute are you
>>>>> needing to record in the postings and access at search time?
>>>>
>>>> For example:
>>>>- part of speech of a token.
>>>>- syntactic parse subtree (over a span).
>>>>- semantically normalized phrase (to canonical text or ontological
>>>> code).
>>>>- semantic group (of a span).
>>>>- coreference link.
>>>>
>>>> stephen
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
It is not clear this is exactly what is needed/being discussed.

>From the issue:
"We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog  wrote:
> Parts-of-speech is available now, in the indexer.
>
> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
> project for natural-language processing.
>
> Some parts are in Solr that could be in Lucene.
>
> https://issues.apache.org/jira/browse/lucene-2899
>
>
> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>   - part of speech of a token.
>>   - syntactic parse subtree (over a span).
>>   - semantically normalized phrase (to canonical text or ontological
>> code).
>>   - semantic group (of a span).
>>   - coreference link.
>>
>> stephen
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
>Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

If this could be fixed (i.e. indexing the _end_ of a span) I think all
the things that I want to do, and the things that can now be done in
GATE very easily, would be possible using Mike's suggested method.


-Glen

On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
 wrote:
> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
>  wrote:
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
>
> So for example part-of-speech is a per-Token-position attribute.
>
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
>
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.
>
> For the span-like attributes (eg a syntactic parse, semantically
> normalized phrase) I think you'd need to do something like
> SynonymFilter in your analysis, i.e. insert new tokens at the position
> where the span started.  Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Glen Newton
+10

These are the kind of things you can do in GATE[1] using annotations[2].
A VERY useful feature.

-Glen

[1]http://gate.ac.uk
[2]http://gate.ac.uk/wiki/jape-repository/annotations.html

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
 wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.
>
> stephen
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene in Corpus Linguistics

2012-09-26 Thread Glen Newton
Yes, very interested.
--> Quick scan: very cool work!  +10   :-)

Thanks,
Glen Newton

On Wed, Sep 26, 2012 at 9:59 AM, Carsten Schnober
 wrote:
> Hi,
> in case someone is interested in an application of the Lucene indexing
> engine in the field of corpus linguistics rather than information retrieval:
> we have worked on that subject for some time and have recently published a
> conference paper about it:
> http://korap.ids-mannheim.de/2012/09/konvens-proceedings-online/
>
> Central issues addressed in this work have been to externally produced and
> concurrent tokenizations as well as multiple linguistic annotations on
> different levels.
>
> Best,
> Carsten
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance of storing data in Lucene vs other (No)SQL Databases

2012-05-18 Thread Glen Newton
Storing content in large indexes can significantly add to index time.

The model of indexing fields only in Lucene and storing just a key,
and then storing the content in some other container (DBMS, NoSql,
etc) with the key as lookup is almost a necessity for this use case
unless you have a completely static index (create once + never add
to).

Thanks,
Glen

On Fri, May 18, 2012 at 10:44 AM, Konstantyn Smirnov
 wrote:
> Hi all,
>
> apologies, if this question was already asked before.
>
> If I need to store a lot of data (say, millions of documents), what would
> perform better (in terms of reads/writes/scalability etc.): Lucene with
> stored fields (Field.Store.YES) or another NoSql DB like Mongo or Couch?
>
> Does it make sense to index and store the data separately?
>
> TIA
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-of-storing-data-in-Lucene-vs-other-No-SQL-Databases-tp3984704.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Customizing indexing of large files

2012-02-27 Thread Glen Newton
Hi,

Understood.
Write a custom FileReader that filters out the text you do not want.
This will do it streaming.

Glen

On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande
 wrote:
> Hi,
>
> Description is multiline, in addition there is other text also. So, 
> essentially what I need id to jump the DATA_END as soon as I hit DATA_BEGIN.
>
> I am creating the field using the constructor Field(String name, Reader 
> reader) and using StandardAnalyser. Right now I am using FileReader which is 
> causing all the text to be indexed/tokenized.
>
> Amount of text I am interested in is also pretty large, description is just 
> one such example. So, I really want some stream based implementation to avoid 
> keeping large amount of text in memory. May be a custom TokenStream, but I 
> don't know what to implement in tokenstream. The only abstract method is 
> incrementToken, I have no idea what to do in it.
>
> Regards,
>
> Prakash Bande
> Director - Hyperworks Enterprise Software
> Altair Eng. Inc.
> Troy MI
> Ph: 248-614-2400 ext 489
> Cell: 248-404-0292
>
> -Original Message-
> From: Glen Newton [mailto:glen.new...@gmail.com]
> Sent: Monday, February 27, 2012 12:05 PM
> To: java-user@lucene.apache.org
> Subject: Re: Customizing indexing of large files
>
> I'd suggest writing a perl script or
> insert-favourite-scripting-language-here script to pre-filter this
> content out of the files before it gets to Lucene/Solr
> Or you could just grep for "Data' and"Description" (or is
> 'Description' multi-line)?
>
> -Glen Newton
>
> On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
>  wrote:
>> Hi,
>>
>> I want to customize the indexing of some specific kind of files I have. I am 
>> using 2.9.3 but upgrading is possible.
>> This is how my file's data looks
>>
>> *
>> Data for 2010
>> Description: This section has a general description of the data.
>> DATA_BEGIN
>> Month       P1          P2          P3
>> 01          3243.433    43534.324   45345.2443
>> 02          3242.324    234234.24   323.2343
>> ...
>> ...
>> ...
>> ...
>> DATA_END
>> Data for 2011
>> Description: This section has a general description of the data.
>> DATA_BEGIN
>> Month       P1          P2          P3
>> 01          3243.433    43534.324   45345.2443
>> 02          3242.324    234234.24   323.2343
>> ...
>> ...
>> ...
>> ...
>> DATA_END
>> *
>>
>> I would like to use a StandardAnalyser, but do not want to index the data of 
>> the columns, i.e. skip all those numbers. Basically, as soon as I hit the 
>> keyword DATA_BEGIN, I want to jump to DATA_END.
>> So, what is the best approach? Using a custom Reader, custom tokenizer or 
>> some other mechanism.
>> Regards,
>>
>> Prakash Bande
>> Altair Eng. Inc.
>> Troy MI
>> Ph: 248-614-2400 ext 489
>> Cell: 248-404-0292
>>
>
>
>
> --
> -
> http://zzzoot.blogspot.com/
> -
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Customizing indexing of large files

2012-02-27 Thread Glen Newton
I'd suggest writing a perl script or
insert-favourite-scripting-language-here script to pre-filter this
content out of the files before it gets to Lucene/Solr
Or you could just grep for "Data' and"Description" (or is
'Description' multi-line)?

-Glen Newton

On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
 wrote:
> Hi,
>
> I want to customize the indexing of some specific kind of files I have. I am 
> using 2.9.3 but upgrading is possible.
> This is how my file's data looks
>
> *
> Data for 2010
> Description: This section has a general description of the data.
> DATA_BEGIN
> Month       P1          P2          P3
> 01          3243.433    43534.324   45345.2443
> 02          3242.324    234234.24   323.2343
> ...
> ...
> ...
> ...
> DATA_END
> Data for 2011
> Description: This section has a general description of the data.
> DATA_BEGIN
> Month       P1          P2          P3
> 01          3243.433    43534.324   45345.2443
> 02          3242.324    234234.24   323.2343
> ...
> ...
> ...
> ...
> DATA_END
> *
>
> I would like to use a StandardAnalyser, but do not want to index the data of 
> the columns, i.e. skip all those numbers. Basically, as soon as I hit the 
> keyword DATA_BEGIN, I want to jump to DATA_END.
> So, what is the best approach? Using a custom Reader, custom tokenizer or 
> some other mechanism.
> Regards,
>
> Prakash Bande
> Altair Eng. Inc.
> Troy MI
> Ph: 248-614-2400 ext 489
> Cell: 248-404-0292
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can I detect incorrect language selection after creating an index?

2012-02-27 Thread Glen Newton
Do the check _before_ indexing.
Use https://code.google.com/p/language-detection/  to verify the
language of the text document before you put it in the index.

-Glen Newton
http://zzzoot.blogspot.com/

On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin  wrote:
> Suppose I have a bunch of text documents in language X but I index ithem 
> using an analyzer for language Y. Once the index is created, is it possible 
> to perform some sort of simple "sanity" check to see if the original language 
> selection was wrong? I presume I can try searching for some common word in 
> language Y, but I am not sure how reliable this would be. On the other hand, 
> if languages are from the same group, say X and Y are English and Spanish, I 
> should expect that this sanity check would produce a false match. However, I 
> would be happy if it worked reliably enough for languages using different 
> scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc.
>
>
> Thanks much
>
>
>
> Ilya Zavorin



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Castle for Lucene/Solr?

2011-09-04 Thread Glen Newton
"Caste" --> Castle

https://bitbucket.org/acunu
http://support.acunu.com/entries/20216797-castle-build-instructions

It looks very promising.
It is a kernel module and I'm not sure it can run in user space, which
I'd prefer.

-Glen Newton

On Sat, Sep 3, 2011 at 9:21 PM, Otis Gospodnetic
 wrote:
> Hello,
>
> I saw mentions of something called "Caste" a while back, but only now looked 
> at what it is, and it sounds like something that's potentially 
> interesting/useful (performance-wise) for Lucene/Solr.
>
> See http://twitter.com/#!/otisg/status/109768673467699200
>
>
> Has anyone tried it with Lucene/Solr by any chance?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What kind of System Resources are required to index 625 million row table...???

2011-08-15 Thread Glen Newton
tart of garbage collection until the heap is full.
Therefore, the first time
that the GC runs, the process can take longer. Also, the heap is more
likely to be
fragmented and require a heap compaction. You are advised to start your
application with the minimum heap size that your application requires. When the
GC starts up, it will run frequently and efficiently, because the heap
is small." - p43

AIX allows different malloc policies to be used in the underlying
system calls. Consider using the WATSON (!) malloc policy. p.134,136
and 
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm

Finally (or before doing all of this! :-)  ), do some profiling, both
inside of Java, and of the AIX native heap using svmon (see "Native
Heap Exhaustion, p.135).

-Glen Newton
http://zzzoot.blogspot.com/




On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony  wrote:
> Thanks for the quick response.
>
> As to your questions:
>
>  Can you talk a bit more about what the search part of this is?
>  What are you hoping to get that you don't already have by adding in search?  
> Choices for fields can have impact on
>  performance, memory, etc.
>
> We currently have a "exact match" search facility, which uses SQL.
> We would like to add "text search" capabilities...
> ...initially, having the ability to search the 229 character field for a 
> given word, or phrase, instead of an exact match.
> A future enhancement would be to add a synonym list.
> As to "field choice", yes, it is possible that all fields would be involved 
> in the "search"...
> ...in the interest of full disclosure, the fields are:
>   - corp  - corporation that owns the document
>   - type  - document type
>   - tmst  - creation timestamp
>   - xmlid - xml namespace ID
>   - tag   - meta data qualifier
>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>
>
>
>  Was this single threaded or multi-threaded?  How big was the resulting index?
>
> The search would be a threaded application.
>
>  How big was the resulting index?
>
> The index that was built was 70 GB in size.
>
>  Have you tried increasing the heap size?
>
> We have increased the up to 4 GB... on an 8 GB machine...
> That's why we'd like a methodology for calculating memory requirements
> to see if this application is even feasible.
>
> Thanks,
> -tony
>
>
> -Original Message-
> From: Grant Ingersoll [mailto:gsing...@apache.org]
> Sent: Monday, August 15, 2011 2:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million 
> row table...???
>
>
> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>
>> We are examining the possibility of using Lucene to provide Text Search
>> capabilities for a 625 million row DB2 table.
>>
>> The table has 6 fields, all which must be stored in the Lucene Index.
>> The largest column is 229 characters, the others are 8, 12, 30, and 1
>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long 
>> long).
>
> Can you talk a bit more about what the search part of this is?  What are you 
> hoping to get that you don't already have by adding in search?  Choices for 
> fields can have impact on performance, memory, etc.
>
>>
>> We have written a test app on a development system (AIX 6.1),
>> and have successfully Indexed 625 million rows...
>> ...which took about 22 hours.
>
> Was this single threaded or multi-threaded?  How big was the resulting index?
>
>
>>
>> When writing the "search" application... we find a simple version works, 
>> however,
>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>
>
> How many terms do you have in your index and in the field you are 
> sorting/filtering on?  Have you tried increasing the heap size?
>
>
>> Before continuing our research, we'd like to find a way to determine
>> what system resources are required to run this kind of application...???
>
> I don't know that there is a straight forward answer here with the 
> information you've presented.  It can depend on how you intend to 
> search/sort/filter/facet, etc.  General rule of thumb is that when you get 
> over 100M documents, you need to shard, but you also have pretty small 
> documents so your mileage may vary.   I've seen indexes in your range on a 
> single machine (for small docs) with low search volumes, but that isn't to 
> say it will work for you without more insight into your documents, etc.
>
>> In

Re: Index one huge text file

2011-07-22 Thread Glen Newton
So to use Lucene-speak, each sentence is a document.

I don't know how you are indexing and what code you are using (and
what hardware, etc.), but you if you are not already, should consider
multi-threading the indexing which should give you a significant
indexing performance boost.

-Glen


On Fri, Jul 22, 2011 at 11:04 AM, starz10de  wrote:
> I am interested to search in sentence level.
> It is a parallel corpora , each sentence in the first language is
> equivalence to sentence in the second language. I want to index each
> sentence and have some id for each sentence in order when I retrieve it I go
> easily and retrieve its equivalence in the second language.
>
> This I did by splitting the file and consider each sentence as text file.
> However, this really takes long time to do for many huge text files.
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-one-huge-text-file-tp3191605p3191628.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index one huge text file

2011-07-22 Thread Glen Newton
Could you elaborate what you want to do with the index of large
documents? Do you want to search at the document or sentence level?
This can drive how to index this content.

-Glen

On Fri, Jul 22, 2011 at 10:52 AM, starz10de  wrote:
> Hi,
>
> I have one text file that contains 60 000 sentences. Is there a possibility
> to index this file sentence by sentence where each sentence is treated as
> one document? What I do now is splitting the huge text files into 60 000
> sentences then index them. This work is not easy because I have few huge
> documents and it take long time to split each file.
>
> Thanks  in advance
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-one-huge-text-file-tp3191605p3191605.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Architecture Site (Prototype)

2011-07-07 Thread Glen Newton
gmail interprets the closing asterisk as part of the URL, for all
three URLs --> 404s
You might want to add a space before the '*'...

-glen

On Thu, Jul 7, 2011 at 2:17 PM, Abhishek Rakshit  wrote:
> Hey folks,
>
> We received great feedback on the Lucene Architecture site that we have been
> building. Thanks for the all the awesome response.
>
> One of the larger pieces of feedback was on making the architectural
> information aligned around concepts. We have created a prototype of the site
> based on this and other feedback. It will be great to hear your thoughts.
>
> Current Site:
> *http://www.codemaps.org/s/Lucene_Core*
>
> Prototype:
> *http://www.architexa.com/codemapsDev/lucene/*
> *http://www.architexa.com/codemapsDev/lucene/analyzer.htm*
>
> What do you guys think? I am really looking forward to hearing what parts of
> the current site and the prototype you like.
>
> We want to keep refining this site till it is a great resource for the
> Lucene community and your advice is very much appreciated.
>
> Cheers,
> Abhishek Rakshit
> --
> Lead Engineer - User Experience,
> Architexa - www.architexa.com
> abhis...@architexa.com
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene on Multi-Processor/Core machines

2011-01-25 Thread Glen Newton
This is some older stuff I have done, likely still fairly relevant. I
would say that today things are _better_ than these results for Lucene
multithreading / multicore. :-)
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html

Glen Newton


On Tue, Jan 25, 2011 at 11:31 AM, Siraj Haider  wrote:
> Hello there,
> I was looking for best practices for indexing/searching on a
> multi-processor/core machine but could not find any specific material on
> that.  Do you think it is a good idea to create a guide/how-to for that
> purpose?  It would be very helpful for many people in todays world, where
> almost all the machines come with multi cores and optionally
> multi-processors.
>
> regards
> -siraj
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Glen Newton
Where do you get your Lucene/Solr downloads from?

[x] ASF Mirrors (linked in our release announcements or via the Lucene website)

[] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)

[] I/we build them from source via an SVN/Git checkout.


-Glen Newton


-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dataimport performance

2010-12-16 Thread Glen Newton
Hi,

LuSqlv2 beta comes out in the next few weeks, and is designed to
address this issue (among others).

LuSql original 
(http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
now moved to: https://code.google.com/p/lusql/) is a JDBC-->Lucene
high performance loader.

You may have seen my posts on this list suggesting LuSql as high
performance alternative to DIH, for a subset of use cases.

LuSqlV2 has evolved into a full extract-transform-load (ETL) high
performance engine, focusing on many of the issues of interest to the
Lucene/SOLR community.
It has a pipelined, pluggable, multithreaded architecture.
It is basically: pluggable source --> 0 or more pluggable filters -->
pluggable sink

Source plugins implemented:
- JDBC, Lucene, SOLR (SolrJ), BDB, CSV, RMI, Java Serialization
Sink plugins implemented:
- JDBC, Lucene, SOLR (SolrJ), BDB, XML, RMI, Java Serialization, Tee,
NullSink [I am working on a memcached Sink]
A number of different filters implemented (i.e. get PDF file from
filesystem based on SQL field and convert & get test, etc) including:
BDBJoinFIlter, JDBCJoinFilter

--

This particular problem is one of the unit tests I have: given a
simple database of:
1- table Name
2- table City
3- table nameCityJoin
4- table Job
5- table nameJobJoin

run a JDBC-->BDB LuSql instance each for of City+nameCityJoin and
Job+nameJobJoin; then run a JDBC-->SolrJ on table Name, adding 2
BDBJoinFIlters, each which take the BDB generated earlier and do the
join (you just tell the filters which field from the JDBC-generated to
use against the BDB key).

So your use case use a larger example of this.

Also of interest:
- Java RMI (Remote Method Invocation): both an RMISink(Server) and
RMISource(Client) are implemented. This means you can set up N
machines which are doing something, and have one or more clients (on
their own machines) that are pulling this data and doing something
with it. For example, JDBC-->PDFToTextFilter-->RMI (converting PDF
files to text based on the contents of a SQL database, with text files
in the file system): basically doing some heavy lifting, and then
start up an RMI-->SolrJ (or Lucene) which is a client to the N PDF
converting machines, doing only the Lucene/SOLR indexing. The client
does a pull when it needs more data. You can have N servers x M
clients! Oh, string fields length > 1024 are automatically gzipped by
the RMI Sink(Server), to reduce network (at the cost of cpu:
selectable). I am looking into RMI alternatives, like Thrift, ProtoBuf
for my next Source/Sinks to implement. Another example is the reverse
use case: when the indexing is more expensive getting the data.
Example: One JDBC-->RMISink(Server) instance, N
RMISource(Client)-->Lucene instances; this allows multiple Lucenes to
be fed from a single JDBC source, across machines.

- TeeSink: the Tee sink hides N sinks, so you can split the pipeline
into multiple Sinks. I've used it to send the same content to Lucene
as well as BDB in one fell swoop. Can you say index and content store
in one step?

I am working on cleaning up the code, writing docs (I made the mistake
of making great docs for LusqlV1, so I have work to do...!), and
making a couple more tests.

I will announce the beta on this and the Lucene list.

If you have any questions, please contact me.

Thanks,
Glen Newton
http://zzzoot.blogspot.com

--> Old LuSql benchmarks:
http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html

On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James  wrote:
> We have ~50 long-running SQL queries that need to be joined and denormalized. 
>  Not all of the queries are to the same db, and some data comes from 
> fixed-width data feeds.  Our current search engine (that we are converting to 
> SOLR) has a fast disk-caching mechanism that lets you cache all of these data 
> sources and then it will join them locally prior to indexing.
>
> I'm in the process of developing something similar for DIH that uses the 
> Berkley db to do the same thing.  Its good enough that I can do nightly full 
> re-indexes of all our data while developing the front-end, but it is still 
> very rough.  Possibly I would like to get this refined enough to eventually 
> submit as a jira ticket / patch as it seems this is a somewhat common problem 
> that needs solving.
>
> Even with our current search engine, the join & denormalize step is always 
> the longest-running part of the process.  However, I have it running fairly 
> fast by partitioning the data by a modulus of the primary key and then 
> running several jobs in parallel.  The trick is not to get I/O bound.  Things 
> run fast if you can set it up to maximize CPU.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Ephraim Ofir [mailto:ephra...@icq.com]
> Sent

IndexTank technology...

2010-11-11 Thread Glen Newton
Does anyone know what technology they are using: http://www.indextank.com/
Is it Lucene under the hood?

Thanks, and apologies for cross-posting.
-Glen

http://zzzoot.blogspot.com

-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene usage on TREC data

2010-08-14 Thread Glen Newton
Lucene has been used - usually as a starting base that has been
modified for specific tasks - by a number of IR researchers for
various TREC challenges. Here are some (there are many more):

IBM Haifa:
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf

Logistic Regression Merging of Amberfish and Lucene Multisearch Results
http://trec.nist.gov/pubs/trec14/papers/ualaska.tera.pdf

IR 2009 Term Project - Modern Web Search (TREC WT10g)
http://code.google.com/p/lucene-web-search-and-trec-wt10g/

Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval
http://comminfo.rutgers.edu/~muresan/IR/TREC/Proceedings/t13_proceedings/papers/alias-i.geo.pdf

Evaluation of the Default Similarity Function in Lucene
http://www.ece.udel.edu/~hfang/lucene/Lucene_exp.pdf

Lucene for n-grams using the ClueWeb collection
http://trec.nist.gov/pubs/trec18/papers/arsc.WEB.pdf

Expanding Queries Using Multiple Resources
http://staff.science.uva.nl/~mdr/Publications/Files/trec2006-proceedings-genomics.pdf

-Glen Newton
http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
http://zzzoot.blogspot.com/2008/11/software-announcement-lusql-database-to.html
http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html






On 14 August 2010 06:55, Ramneek Maan Singh  wrote:
> Hello Everyone,
>
> Can anyone point me to a publicly Question answering system built using
> lucene on TREC or non-TREC data.
>
> Regards,
> Ramneek
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using categories with Lucene

2010-08-09 Thread Glen Newton
Hi Luan,

Could you tell us the name and/or URL of this plugin so that the list
might know about it?
Thanks,
Glen


On 10 August 2010 12:21, Luan Cestari  wrote:
>
> We would like to say thanks for the replies.
>
> We found a plugin in Nutch (the Creative Commons plugin) that does like Otis
> said. It adds information to the indexes, and then uses them to filter the
> results during the query.
>
> Thanks again for the help.
>
> Best Regards,
> Daniel & Luan
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1066049.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Databases

2010-07-22 Thread Glen Newton
LuSql is a tool specifically oriented to extracting from JDBC
accessible databases and indexing the contents.
You can find it here:
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
User manual:
 http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

A new version is coming out in the next  month, but the existing one
should be fine for what you have described.
If you have any questions, just let me know.

Note that if you are interested in using Solr for your application,
the data import handler (DIH) is a very flexible way of doing what you
are describing, in a Solr context.
http://wiki.apache.org/solr/DataImportHandler

Thanks,
-Glen Newton
LuSql author
http://zzzoot.blogspot.com/

On 23 July 2010 15:46, manjula wijewickrema  wrote:
> Hi,
>
> Normally, when I am building my index directory for indexed documents, I
> used to keep my indexed files simply in a directory called 'filesToIndex'.
> So in this case, I do not use any standar database management system such
> as mySql or any other.
>
> 1) Will it be possible to use mySql or any other for the purpose of manage
> indexed documents in Lucene?
>
> 2) Is it necessary to follow such kind of methodology with Lucene?
>
> 3) If we do not use such type of database management system, will there be
> any disadvantages with large number of indexed files?
>
> Appreciate any reply from you.
> Thanks,
> Manjula.
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for searcher memory usage?

2010-07-14 Thread Glen Newton
There are a number of strategies, on the Java or OS side of things:
- Use huge pages[1]. Esp on 64 bit and lots of ram. For long running,
large memory (and GC busy) applications, this has achieved significant
improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article
introducing and benchmarking huge tables, both in C and Java, see [5]
 To see if huge pages might help you, do
  > cat /proc/meminfo
 And check on the "PageTables:26480 kB"
 If the PageTables is, say, more than 1-2GBs, you should consider
using huge pages.
- assuming multicore: there are times (very application dependent)
when having your application running on all cores turns out not to
produce the best performance. Take one core out making it available to
look after system things (I/O, etc) sometimes will improve
performance. Use numactl[6] to bind your application to n-1 cores,
leaving one out.
- - numactl also allows you to restrict memory allocation to 1-n
cores, which also may be useful depending on your application
- The Java vm from Sun-Oracle has a number of options[7]
  - -XX:+AggressiveOpts [You should have this one on always...]
  - -XX:+StringCache
  - -XX:+UseFastAccessorMethods

  - -XX:+UseBiasedLocking  [My experience has this helping some
applications, hindering others...]
  - -XX:ParallelGCThreads= [Usually this is #cores; try reducing this to n/2]
  - -Xss128k
  - -Xmn [Make this large, like of your 40% of heap -Xmx If you do
this use -XX:+UseParallelGC See [8]
You can also play with the many GC parameters. This is pretty arcane,
but can give you good returns.

And of course, I/O is important: data on multiple disks with multiple
controllers; RAID; filesystem tuning ; turn off atime; readahead
buffer (change from 128k to 8MB on Linux: see [9]) OS tuning. See [9]
for a useful filesystem comparison (for Postgres).

-glen
http://zzzoot.blogspot.com/

[1]http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
[2]http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html
[3]http://kirkwylie.blogspot.com/2008/11/linux-fork-performance-redux-large.html
[4]http://orainternals.files.wordpress.com/2008/10/high_cpu_usage_hugepages.pdf
[5]http://lwn.net/Articles/374424/
[6]http://www.phpman.info/index.php/man/numactl/8
[7]http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp#PerformanceTuning
[8]http://java.sun.com/performance/reference/whitepapers/tuning.html#section4.2.5
[9]http://assets.en.oreilly.com/1/event/27/Linux%20Filesystem%20Performance%20for%20Databases%20Presentation.pdf

On 15 July 2010 04:28, Christopher Condit  wrote:
> Hi Toke-
>> > * 20 million documents [...]
>> > * 140GB total index size
>> > * Optimized into a single segment
>>
>> I take it that you do not have frequent updates? Have you tried to see if you
>> can get by with more segments without significant slowdown?
>
> Correct - in fact there are no updates and no deletions. We index everything 
> offline when necessary and just swap the new index in...
> By more segments do you mean not call optimize() at index time?
>
>> > The application will run with 10G of -Xmx but any less and it bails out.
>> > It seems happier if we feed it 12GB. The searches are starting to bog
>> > down a bit (5-10 seconds for some queries)...
>>
>> 10G sounds like a lot for that index. Two common memory-eaters are sorting
>> by field value and faceting. Could you describe what you're doing in that
>> regard?
>
> No faceting and no sorting (other than score) for this index...
>
>> Similarly, the 5-10 seconds for some queries seems very slow. Could you give
>> some examples on the queries that causes problems together with some
>> examples of fast queries and how long they take to execute?
>
> Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND 
> (Salsa OR Sauce) AND (This OR That)
> The latter is most typical.
>
> With a single keyword it will execute in < 1 second. In a case where there 
> are 10 clauses it becomes much slower (which I understand, just looking for 
> ways to speed it up)...
>
> Thanks,
> -Chris
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: If you could have one feature in Lucene...

2010-02-27 Thread Glen Newton
Hello Uwe.

That will teach me for not keeping up with the versions! :-)
So it is up to the application to keep track of what it used for compression.
Understandable.
Thanks!

Glen

On 27 February 2010 10:17, Uwe Schindler  wrote:
> Hi Glen,
>
>
>> Pluggable compression allowing for alternatives to gzip for text
>> compression for storing.
>> Specifically I am interested in bzip2[1] as implemented in Apache
>> Commons Compress[2].
>> While bzip2 compression is considerable slower than gzip (although
>> decompression is not too much slower than gzip) it compresses much
>> better than gzip (especially text).
>>
>> Having the choice would be helpful, and for Lucene usage for non-text
>> indexing, content specific compression algorithms may outperform the
>> default gzip.
>
> Since Version 3.0 / 2.9 of Lucene compression support was removed entirely 
> (in 2.9 still avail as deprecated). All you now have to do is simply store 
> your compressed stored fields as a byte[] (see Field javadocs). By that you 
> can use any compression. The problems with gzip and the other available 
> compression algos lead us to removing the compression support from Lucene (as 
> it had lots of problems). In general the way to go is: Create a 
> ByteArrayOutputStream and wrap with any compression filter, then feed your 
> data in and use "new Field(name,stream.getBytes())". On the client side just 
> use the inverse (Document.getBinaryValue(), create input stream on top of 
> byte[] and decompress).
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: If you could have one feature in Lucene...

2010-02-27 Thread Glen Newton
Pluggable compression allowing for alternatives to gzip for text
compression for storing.
Specifically I am interested in bzip2[1] as implemented in Apache
Commons Compress[2].
While bzip2 compression is considerable slower than gzip (although
decompression is not too much slower than gzip) it compresses much
better than gzip (especially text).

Having the choice would be helpful, and for Lucene usage for non-text
indexing, content specific compression algorithms may outperform the
default gzip.

And in these days of multi-core / multi-threading, perhaps we could
convince the Apache Commons Compress team to implement a parallel Java
version of bzip2 compression (theirs is single threaded), like
pbzip2[3].

-glen


[1]http://en.wikipedia.org/wiki/Bzip2
[2]http://commons.apache.org/compress/
[3]http://compression.ca/pbzip2/

On 24 February 2010 08:42, Grant Ingersoll  wrote:
> What would it be?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Exception while adding document in 3.0

2010-02-02 Thread Glen Newton
Documents cannot be re-used in v3.0?
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

-glen
http://zzzoot.blogspot.com/

On 2 February 2010 02:55, Simon Willnauer
 wrote:
> Ganesh,
>
> do you reuse your Document instances in any way or do you create new
> docs for each add?
>
> simon
>
> On Tue, Feb 2, 2010 at 7:18 AM, Ganesh  wrote:
>> I am getting below exception, while adding documents. I am adding documents 
>> continously and at some point, i am getting the below exception. This 
>> exception is not occuring with v2.9.0
>>
>>  Exception: Index: 21, Size: 2
>>  java.util.ArrayList.RangeCheck(Unknown Source)
>>  java.util.ArrayList.get(Unknown Source)
>>  org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:175)
>>  org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:779)
>>  org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:757)
>>  org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2472)
>>  org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2446)
>>
>> Regards
>> Ganesh
>> Send instant messages to your online friends http://in.messenger.yahoo.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-18 Thread Glen Newton
Yes, I would agree with you on the surprise aspect. :-)

But you suggest hiding complexity, and being in control and having
transparency are mutually exclusive, which isn't necesarily the case.

I think I can live with the decisions made. :-)
If I can think of a viable and complete alternative, I'll run it by
the community.

thanks,
Glen

2009/11/18 Otis Gospodnetic :
> Well, I think some people will be for hiding complexity, while others will be 
> for being in control and having transparency.  Think how surprised one would 
> be to find 1 extra field in his index, say when looking at their index with 
> Luke. :)
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message 
>> From: Glen Newton 
>> To: java-user@lucene.apache.org
>> Sent: Tue, November 17, 2009 10:53:01 PM
>> Subject: Re: Lucene Java 3.0.0 RC1 now available for testing
>>
>> I understand the reasons, but - if I may ask so late in the game - was
>> this the best way to do this?
>>
>> From a user (developer) perspective, this is an implementation issue.
>> Couldn't this have been done behind the scenes, so that when I asked
>> for Field.Index.ANALYZED  && Field.Store.COMPRESS, instead of what
>> previously happened (and was variously problematic), two fields were
>> transparently created, one being binary compressed stored and the
>> other being indexed only? The Field API could hide all of this
>> complexity, using one underlying Field when I use Field.getString()
>> (compressed stored one), using the other when I use Field.setBoost()
>> (the indexed one) and both when I call Field.setValue(). This might
>> have less impact on developers and be less disruptive on API changes.
>> Oh, some naming convention could handle the underlying Fields.
>>
>> A little complicated I agree.
>>
>> Again, apologies to those who worked hard on these changes: my fault
>> for not noticing this sooner (I hadn't started moving my code to 2.9
>> from 2.4 so I hadn't read the deprecation signs).
>>
>> thanks,
>>
>> Glen
>>
>>
>>
>> 2009/11/17 Mark Miller :
>> > Here is some of the history:
>> >
>> > https://issues.apache.org/jira/browse/LUCENE-652
>> > https://issues.apache.org/jira/browse/LUCENE-1960
>> >
>> > Glen Newton wrote:
>> >> Could someone send me where the rationale for the removal of
>> >> COMPRESSED fields is? I've looked at
>> >>
>> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior
>> >> but it is a little light on the 'why' of this change.
>> >>
>> >> My fault - of course - for not paying attention.
>> >>
>> >> thanks,
>> >> Glen
>> >>
>> >> 2009/11/17 Uwe Schindler :
>> >>
>> >>> Hello Lucene users,
>> >>>
>> >>>
>> >>>
>> >>> On behalf of the Lucene dev community (a growing community far larger 
>> >>> than
>> >>> just the committers) I would like to announce the first release candidate
>> >>> for Lucene Java 3.0.
>> >>>
>> >>>
>> >>>
>> >>> Please download and check it out - take it for a spin and kick the 
>> >>> tires. If
>> >>> all goes well, we hope to release the final version of Lucene 3.0 in a
>> >>> little over a week.
>> >>>
>> >>>
>> >>>
>> >>> The new version is mostly a cleanup release without any new features. All
>> >>> deprecations targeted to be removed in version 3.0 were removed. If you 
>> >>> are
>> >>> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
>> >>> warnings in your code base to be able to recompile against this version.
>> >>>
>> >>>
>> >>>
>> >>> This is the first Lucene release with Java 5 as a minimum requirement. 
>> >>> The
>> >>> API was cleaned up to make use of Java 5's generics, varargs, enums, and
>> >>> autoboxing. New users of Lucene are advised to use this version for new
>> >>> developments, because it has a clean, type safe new API. Upgrading users 
>> >>> can
>> >>> now remove unnecessary casts and add generics to their co

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
I understand the reasons, but - if I may ask so late in the game - was
this the best way to do this?

>From a user (developer) perspective, this is an implementation issue.
Couldn't this have been done behind the scenes, so that when I asked
for Field.Index.ANALYZED  && Field.Store.COMPRESS, instead of what
previously happened (and was variously problematic), two fields were
transparently created, one being binary compressed stored and the
other being indexed only? The Field API could hide all of this
complexity, using one underlying Field when I use Field.getString()
(compressed stored one), using the other when I use Field.setBoost()
(the indexed one) and both when I call Field.setValue(). This might
have less impact on developers and be less disruptive on API changes.
Oh, some naming convention could handle the underlying Fields.

A little complicated I agree.

Again, apologies to those who worked hard on these changes: my fault
for not noticing this sooner (I hadn't started moving my code to 2.9
from 2.4 so I hadn't read the deprecation signs).

thanks,

Glen



2009/11/17 Mark Miller :
> Here is some of the history:
>
> https://issues.apache.org/jira/browse/LUCENE-652
> https://issues.apache.org/jira/browse/LUCENE-1960
>
> Glen Newton wrote:
>> Could someone send me where the rationale for the removal of
>> COMPRESSED fields is? I've looked at
>> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior
>> but it is a little light on the 'why' of this change.
>>
>> My fault - of course - for not paying attention.
>>
>> thanks,
>> Glen
>>
>> 2009/11/17 Uwe Schindler :
>>
>>> Hello Lucene users,
>>>
>>>
>>>
>>> On behalf of the Lucene dev community (a growing community far larger than
>>> just the committers) I would like to announce the first release candidate
>>> for Lucene Java 3.0.
>>>
>>>
>>>
>>> Please download and check it out - take it for a spin and kick the tires. If
>>> all goes well, we hope to release the final version of Lucene 3.0 in a
>>> little over a week.
>>>
>>>
>>>
>>> The new version is mostly a cleanup release without any new features. All
>>> deprecations targeted to be removed in version 3.0 were removed. If you are
>>> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
>>> warnings in your code base to be able to recompile against this version.
>>>
>>>
>>>
>>> This is the first Lucene release with Java 5 as a minimum requirement. The
>>> API was cleaned up to make use of Java 5's generics, varargs, enums, and
>>> autoboxing. New users of Lucene are advised to use this version for new
>>> developments, because it has a clean, type safe new API. Upgrading users can
>>> now remove unnecessary casts and add generics to their code, too. If you
>>> have not upgraded your installation to Java 5, please read the file
>>> JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
>>> 3.0, it will also happen with any previous release when you upgrade your
>>> Java environment).
>>>
>>>
>>>
>>> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
>>> deprecated compressed fields; support for them was removed now. Lucene 3.0
>>> is still able to read indexes with compressed fields, but as soon as merges
>>> occur or the index is optimized, all compressed fields are decompressed and
>>> converted to Field.Store.YES. Because of this, indexes with compressed
>>> fields can suddenly get larger.
>>>
>>>
>>>
>>> While we generally try and maintain full backwards compatibility between
>>> major versions, Lucene 3.0 has some minor breaks, mostly related to
>>> deprecation removal, pointed out in the 'Changes in backwards compatibility
>>> policy' section of CHANGES.txt. Notable are:
>>>
>>>
>>>
>>> - IndexReader.open(Directory) now opens in read-only mode per default (this
>>> method was deprecated because of that in 2.9). The same occurs to
>>> IndexSearcher.
>>>
>>> - Already started in 2.9, core TokenStreams are now made final to enforce
>>> the decorator pattern.
>>>
>>> - If you interrupt an IndexWriter merge thread, IndexWriter now throws an
>>> unchecked ThreadInterruptedException that extends RuntimeException and
>>> clears the interrupt status.
>>>
>>>
>>>
>&g

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
Could someone send me where the rationale for the removal of
COMPRESSED fields is? I've looked at
http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior
but it is a little light on the 'why' of this change.

My fault - of course - for not paying attention.

thanks,
Glen

2009/11/17 Uwe Schindler :
> Hello Lucene users,
>
>
>
> On behalf of the Lucene dev community (a growing community far larger than
> just the committers) I would like to announce the first release candidate
> for Lucene Java 3.0.
>
>
>
> Please download and check it out - take it for a spin and kick the tires. If
> all goes well, we hope to release the final version of Lucene 3.0 in a
> little over a week.
>
>
>
> The new version is mostly a cleanup release without any new features. All
> deprecations targeted to be removed in version 3.0 were removed. If you are
> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
> warnings in your code base to be able to recompile against this version.
>
>
>
> This is the first Lucene release with Java 5 as a minimum requirement. The
> API was cleaned up to make use of Java 5's generics, varargs, enums, and
> autoboxing. New users of Lucene are advised to use this version for new
> developments, because it has a clean, type safe new API. Upgrading users can
> now remove unnecessary casts and add generics to their code, too. If you
> have not upgraded your installation to Java 5, please read the file
> JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
> 3.0, it will also happen with any previous release when you upgrade your
> Java environment).
>
>
>
> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
> deprecated compressed fields; support for them was removed now. Lucene 3.0
> is still able to read indexes with compressed fields, but as soon as merges
> occur or the index is optimized, all compressed fields are decompressed and
> converted to Field.Store.YES. Because of this, indexes with compressed
> fields can suddenly get larger.
>
>
>
> While we generally try and maintain full backwards compatibility between
> major versions, Lucene 3.0 has some minor breaks, mostly related to
> deprecation removal, pointed out in the 'Changes in backwards compatibility
> policy' section of CHANGES.txt. Notable are:
>
>
>
> - IndexReader.open(Directory) now opens in read-only mode per default (this
> method was deprecated because of that in 2.9). The same occurs to
> IndexSearcher.
>
> - Already started in 2.9, core TokenStreams are now made final to enforce
> the decorator pattern.
>
> - If you interrupt an IndexWriter merge thread, IndexWriter now throws an
> unchecked ThreadInterruptedException that extends RuntimeException and
> clears the interrupt status.
>
>
>
> Also, remember that this is a release candidate, and not the final Lucene
> 3.0 release.
>
>
>
> You can find the full list of changes here:
>
>
>
> HTML version:
>
> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/C
> hanges.html
>
>
>
> Text version:
>
> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/C
> hanges.txt
>
>
>
> Changes have also occurred in Lucene's contrib area:
>
>
>
> HTML version:
>
> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/C
> ontrib-Changes.html
>
>
>
> Text version:
>
> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/C
> ontrib-Changes.txt
>
>
>
> Download release candidate 1 here:
>
> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/
>
>
>
> Be sure to report back with any issues you find! Look especially for faults
> in generification of public APIs (like missing wildcards,...).
>
>
>
> Thanks,
>
> Uwe Schindler
>
>
>
> -
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
>
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene index write performance optimization

2009-11-10 Thread Glen Newton
You might try re-implementing, using ThreadPoolExecutor
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html

glen

2009/11/10 Jamie Band :
> Hi There
>
> Our app spends alot of time waiting for Lucene to finish writing to the
> index. I'd like to minimize this. If you have a moment to spare, please let
> me know if my LuceneIndex class presented below can be improved upon.
>
> It is used in the following way:
>
> luceneIndex = new
> LuceneIndex(Config.getConfig().getIndex().getIndexBacklog(),
>                                           exitReq,volume.getID()+"
> indexer",volume.getIndexPath(),
>
> Config.getConfig().getIndex().getMaxSimultaneousDocs());
> Document doc = new Document();
> IndexInfo indexInfo = new IndexInfo(doc);
> luceneIndex.indexDocument(indexInfo);
>
> As an aside note, is there any way for Lucene to support simultaneous writes
> to an index? For example, each write threads could write to a separate
> shard, after a period the shared could be merged into a single index? Or is
> this overkill? I am interested hear the opinion of the Lucene experts.
>
> Thanks in advance
>
> Jamie
>
> package com.stimulus.archiva.index;
>
> import java.io.File;
> import java.io.IOException;
> import java.io.PrintStream;
> import org.apache.commons.logging.*;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.index.*;
> import org.apache.lucene.store.FSDirectory;
> import java.util.*;
> import org.apache.lucene.store.LockObtainFailedException;
> import org.apache.lucene.store.AlreadyClosedException;
> import java.util.concurrent.locks.ReentrantLock;
> import java.util.concurrent.*;
>
> public class LuceneIndex extends Thread {
>          protected ArrayBlockingQueue queue;
>        protected static final Log logger =
> LogFactory.getLog(LuceneIndex.class.getName());
>        protected static final Log indexLog = LogFactory.getLog("indexlog");
>           IndexWriter writer = null;
>           protected static ScheduledExecutorService scheduler;
>        protected static ScheduledFuture scheduledTask;
>        protected LuceneDocument EXIT_REQ = null;
>        ReentrantLock indexLock = new ReentrantLock();
>        ArchivaAnalyzer analyzer     = new ArchivaAnalyzer();
>        File indexLogFile;
>        PrintStream indexLogOut;
>        IndexProcessor indexProcessor;
>        String friendlyName;
>        String indexPath;
>        int maxSimultaneousDocs;
>                   public LuceneIndex(int queueSize, LuceneDocument exitReq,
>                                String friendlyName, String indexPath, int
>  maxSimultaneousDocs) {
>               this.queue = new
> ArrayBlockingQueue(queueSize);
>               this.EXIT_REQ = exitReq;
>               this.friendlyName = friendlyName;
>               this.indexPath = indexPath;
>               this.maxSimultaneousDocs = maxSimultaneousDocs;
>               setLog(friendlyName);
>           }
>                             public int getMaxSimultaneousDocs() {
>             return maxSimultaneousDocs;
>         }
>                 public void setMaxSimultaneousDocs(int maxSimultaneousDocs)
> {
>             this.maxSimultaneousDocs = maxSimultaneousDocs;
>         }
>                             public ReentrantLock getIndexLock() {
>             return indexLock;
>         }
>             protected void setLog(String logName) {
>
>               try {
>                   indexLogFile = getIndexLogFile(logName);
>                   if (indexLogFile!=null) {
>                       if (indexLogFile.length()>10485760)
>                           indexLogFile.delete();
>                       indexLogOut = new PrintStream(indexLogFile);
>                   }
>                   logger.debug("set index log file path
> {path='"+indexLogFile.getCanonicalPath()+"'}");
>               } catch (Exception e) {
>                   logger.error("failed to open index log
> file:"+e.getMessage(),e);
>               }
>         }
>               protected File getIndexLogFile(String logName) {
>              try {
>                   String logfilepath =
> Config.getFileSystem().getLogPath()+File.separator+logName+"index.log";
>                   return new File(logfilepath);
>               } catch (Exception e) {
>                   logger.error("failed to open index log
> file:"+e.getMessage(),e);
>                   return null;
>               }
>         }
>                             protected void openIndex() throws
> MessageSearchException {
>           Exception lastError = null;
>                     if (writer==null) {
>               logger.debug("openIndex() index "+friendlyName+" will be
> opened. it is currently closed.");
>           } else {
>               logger.debug("openIndex() did not bother opening index
> "+friendlyName+". it is already open.");
>               return;
>           }
>           logger.debug("opening index "+friendlyName+" for write");
>           logger.debug("opening searc

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
This is basically what LuSql does. The time increases ("8h to 30 min")
are similar. Usually on the order of an order of magnitude.

Oh, the comments suggesting most of the interaction is with the
database? The answer is: it depends.
With large Lucene documents: Lucene is the limiting factor (worsened
by going single threaded).
With small documents: it can be the DB.

Other issues include waiting for complex queries on the DB to be ready
(avoid sorting in the SQL!!).
LuSql supports out-of-band joins (don't do the join in the SQL, but do
the join from the client (with an additional- but low cost as it is
usually on the primary key - query for each record); sometimes this is
better; sometimes this is worse, depending on your DB design, queries,
etc.)

-Glen

2009/10/22 Thomas Becker :
> Profile your application first hand and find out where the bottlenecks really
> are during indexing.
>
> For me it was clearly the database calls which took most of the time. Due to a
> very complex SQL Query.
> I applied the Producer - Consumer pattern and put a blocking queue in 
> between. I
> have a threadpool running x producers which are sending SQL Queries to the
> database. Each returned row is put into the blockingQueue and another 
> threadpool
> running x (currently only 1) consumers is taking Objects from the row, 
> converts
> them to lucene documents and adds them to the index.
> If the last row is put into the queue I add a Poison Pill to tell the consumer
> to break.
> Using a blockingQueue limited to 10.000 entries together with jdbc fetchSize
> avoids high memory consumptions if too many producer threads return from the 
> db.
>
> This way I could reduce indexing time from around 8h to 30 min. (really). But 
> be
> careful. Load on the DB Server will surely increase.
>
> Hope that helps.
>
> Cheers,
> Thomas
>
> Paul Taylor wrote:
>> I'm building a lucene index from a database, creating 1 about 1 million
>> documents, unsuprisingly this takes quite a long time.
>> I do this by sending a query  to the db over a range of ids , (10,000)
>> records
>> Add these results in Lucene
>> Then get next 10, and so on.
>> When completed indexing I then call optimize()
>> I also set  indexWriter.setMaxBufferedDocs(1000) and
>> indexWriter.setMergeFactor(3000) but don't fully understand these values.
>> Each document contains about 10 small fields
>>
>> I'm looking for some ways to improve performance.
>>
>> This index writing is single threaded, is there a way I can multi-thread
>> writing to the indexing ?
>> I only call optimize() once at the end, is the best way to do it.
>> I'm going to run a profiler over the code, but are there any rules of
>> thumbs on the best values to set for MaxBufferedDocs and Mergefactor()
>>
>> thanks Paul
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> --
> Thomas Becker
> Senior JEE Developer
>
> net mobile AG
> Zollhof 17
> 40221 Düsseldorf
> GERMANY
>
> Phone:    +49 211 97020-195
> Fax:      +49 211 97020-949
> Mobile:   +49 173 5146567 (private)
> E-Mail:   mailto:thomas.bec...@net-m.de
> Internet: http://www.net-m.de
>
> Registergericht:  Amtsgericht Düsseldorf, HRB 48022
> Vorstand:         Theodor Niehues (Vorsitzender), Frank Hartmann,
>                 Kai Markus Kulas, Dieter Plassmann
> Vorsitzender des
> Aufsichtsrates:   Dr. Michael Briem
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
You might want to consider using LuSql, which is a high performance,
multithreaded, well documented tool designed specifically for moving
data from a JDBC database into Lucene (you didn't say if it was a
JDBC-accessible db...)
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclosure: I am the author of LuSql.

-Glen Newton
 http://zzzoot.blogspot.com/
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Glen_Newton


2009/10/22 Paul Taylor :
> I'm building a lucene index from a database, creating 1 about 1 million
> documents, unsuprisingly this takes quite a long time.
> I do this by sending a query  to the db over a range of ids , (10,000)
> records
> Add these results in Lucene
> Then get next 10, and so on.
> When completed indexing I then call optimize()
> I also set  indexWriter.setMaxBufferedDocs(1000) and
>  indexWriter.setMergeFactor(3000) but don't fully understand these values.
> Each document contains about 10 small fields
>
> I'm looking for some ways to improve performance.
>
> This index writing is single threaded, is there a way I can multi-thread
> writing to the indexing ?
> I only call optimize() once at the end, is the best way to do it.
> I'm going to run a profiler over the code, but are there any rules of thumbs
> on the best values to set for MaxBufferedDocs and Mergefactor()
>
> thanks Paul
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field with reader limitation arbitrary

2009-09-15 Thread Glen Newton
I appreciate your explanation, but I think that the use case I
described merits a deeper exploration:

Scenario 1: 16 threads indexing; queue size = 1000; present api; need to store
In this scenario, there are always 1000 Strings with all the contents
of their respective files.
Averaging 50k per document = 50MB of String objects in memory.

Scenario 2: 16 threads indexing; queue size = 1000; Field constructor
with Reader and Index/Store/Tokenize; need to store
At any one time, there are only 16 Strings with their respective file
contents (i.e. read in at index time); and 984 Readers waiting in the
queue.
Averaging 50k per document = 800k of String objects in memory +
overhead of 984 Readers

Or am I not understanding something in your explanation? My
understanding is that the IndexWriter serializes Document writes, but
does not queue them (explicitly: locking multiple calls that are
waiting is not an explicit queue).

Of course, a change I could make would be to defer populating the
Field from the file Reader until just before it gets indexed, using
the String Field constructor, resulting in the equivalent of #2 above.
But pushing this to the API would be easier.

Thanks,
Glen

2009/9/15 Chris Hostetter :
>
> : Someone has made the decision that we will not be interested in
> : storing files read using a Reader (at least not with these
> : constructors).
> : This is rather arbitrary.
>
> No, it was not arbitrary at all.
>
> The javadocs there are not a "decree" of what shall or shan't be
> supported, they are an explanation of how the constructor works so that
> there isn't any confusion.
>
> The *reason* why the code works that way, is because when Lucene
> processes Fields that use a Reader, it doesn't buffer the Reader so it
> can't store the full contents of that Reader. the *reason* it doesnt'
> buffer the reader is because then it would have to make very arbitrary
> memory decisions that could easily result in OOM (Readers can in fact be
> infinite streams of characters)
>
> clients that want to store the value, need to be able to provide the
> entire value as a String.
>
> : might want to also store files in the index,  having a queue of 1000
> : Documents with 1000 Readers to files is vastly preferable to having
> : 1000 documents with 1000 (perhaps very large) Strings with all the
> : contents of the files. While this is not the best for all cases (#open
>
> You've just pointed out exactly why it's not feasible for Lucene to store
> the contents of the Reader -- it would have to do the exact same thing you
> describe.  The curerent API leaves it up to the client to decide when to
> do this, and what to do if the Reader produces a String bigger then it
> wants to deal with.
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field with reader limitation arbitrary

2009-09-15 Thread Glen Newton
OK, thanks. :-)

Glen

2009/9/14 Anthony Urso :
> It's best to file a feature request on the Lucene issue tracker if you
> are interested in seeing this implemented.
>
> http://issues.apache.org/jira/browse/LUCENE
>
> Just cut and paste your description and attach a patch and/or tests if
> you have them.
>
> Cheers,
> Anthony
>
> On Mon, Sep 14, 2009 at 1:03 PM, Glen Newton  wrote:
>> Hi,
>>
>> In 2.4.1, Field has 2 constructors that involve a Reader:
>> public Field(String name,
>>                  Reader reader)
>> public Field(String name,
>>                  Reader reader,
>>                  Field.TermVector termVector)
>>
>> http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/document/Field.html#Field(java.lang.String,%20java.io.Reader,%20org.apache.lucene.document.Field.TermVector)
>>
>> The Reader references a text file on the filesystem. These
>> constructors do the following:
>> "Create a tokenized and indexed field that is not stored, optionally
>> with storing term vectors. The Reader is read only when the Document
>> is added to the index, i.e. you may not close the Reader until
>> IndexWriter.addDocument(Document)  has been called."
>>
>> Someone has made the decision that we will not be interested in
>> storing files read using a Reader (at least not with these
>> constructors).
>> This is rather arbitrary.
>> As someone who has massively parralelized my indexing AND sometimes
>> might want to also store files in the index,  having a queue of 1000
>> Documents with 1000 Readers to files is vastly preferable to having
>> 1000 documents with 1000 (perhaps very large) Strings with all the
>> contents of the files. While this is not the best for all cases (#open
>> file handles, etc), this is a use case which would benefit from being
>> able to do this (i.e. reduced memory footprint, especially for large
>> files or large queues).
>>
>> Suggestion: replace or add a constructor with:
>> public Field(String name,
>>             Reader reader,
>>             Field.Store store,
>>             Field.Index index,
>>             Field.TermVector termVector)
>>
>> Constructively,
>>
>> Glen Newton
>>  http://zzzoot.blogspot.com/
>>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Field with reader limitation arbitrary

2009-09-14 Thread Glen Newton
Hi,

In 2.4.1, Field has 2 constructors that involve a Reader:
public Field(String name,
  Reader reader)
public Field(String name,
  Reader reader,
  Field.TermVector termVector)

http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/document/Field.html#Field(java.lang.String,%20java.io.Reader,%20org.apache.lucene.document.Field.TermVector)

The Reader references a text file on the filesystem. These
constructors do the following:
"Create a tokenized and indexed field that is not stored, optionally
with storing term vectors. The Reader is read only when the Document
is added to the index, i.e. you may not close the Reader until
IndexWriter.addDocument(Document)  has been called."

Someone has made the decision that we will not be interested in
storing files read using a Reader (at least not with these
constructors).
This is rather arbitrary.
As someone who has massively parralelized my indexing AND sometimes
might want to also store files in the index,  having a queue of 1000
Documents with 1000 Readers to files is vastly preferable to having
1000 documents with 1000 (perhaps very large) Strings with all the
contents of the files. While this is not the best for all cases (#open
file handles, etc), this is a use case which would benefit from being
able to do this (i.e. reduced memory footprint, especially for large
files or large queues).

Suggestion: replace or add a constructor with:
public Field(String name,
 Reader reader,
 Field.Store store,
 Field.Index index,
 Field.TermVector termVector)

Constructively,

Glen Newton
 http://zzzoot.blogspot.com/
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
Paul,

I saw your last post and now understand the issues you face.

I don't think there has been any effort to produce a
reduced-memory-footprint configurable (RMFC) Lucene. With the many
mobile devices, embedded and other reduced memory devices, should this
perhaps be one of the areas the Lucene community looks in to?

-Glen

2009/9/11  :
> Thanks Glen!
>
> I will take at your project.  Unfortunately I will only have 512 MB to 1024 
> MB to work with as Lucene is only one component in a larger software system 
> running on one machine.  I agree with you on the C\C++ comment.  That is what 
> I would normally use for memory intense software.  It turns out that the 
> larger file you want to index is the larger the heap space you will need.  
> What I would like to see is a way to "throttle" the indexing process to 
> control the memory footprint.  I understand that this will take longer, but 
> if I perform the task during off hours it shouldn't matter. At least the file 
> will be indexed correctly.
>
> Thanks,
> Paul
>
>
> -Original Message-
> From: java-user-return-42272-paul_murdoch=emainc@lucene.apache.org 
> [mailto:java-user-return-42272-paul_murdoch=emainc@lucene.apache.org] On 
> Behalf Of Glen Newton
> Sent: Friday, September 11, 2009 9:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing large files? - No answers yet...
>
> In this project:
>  http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>
> I concatenate all the text of all of articles of a single journal into
> a single text file.
> This can create a text file that is 500MB in size.
> Lucene is OK in indexing files this size (in parallel even), but I
> have a heap size of 8GB.
>
> I would suggest increasing your heap to as large as your machine can
> reasonably take.
> The reality is that Java programs (like Lucene) take up more memory
> than a similar C or even C++ program.
> Java may approach C/C++ in speed, but not memory.
>
> We don't use Java because of its memory footprint!  ;-)
>
> See:
>  Programming language shootout: speed:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0
>  Programming language shootout: memory:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0
>
> -glen
>
> 2009/9/11 Dan OConnor :
>> Paul:
>>
>> My first suggestion would be to update your JVM to the latest version (or at 
>> least .14). There were several garbage collection related issues resolved in 
>> version 10 - 13 (especially dealing with large heaps).
>>
>> Next, your IndexWriter parameters would help figure out why you are using so 
>> much RAM
>>getMaxFieldLength()
>>getMaxBufferedDocs()
>>getMaxMergeDocs()
>>getRAMBufferSizeMB()
>>
>> How often are you calling commit?
>> Do you close your IndexWriter after every document?
>> How many documents of this size are you indexing?
>> Have you used luke to look at your index?
>> If this is a large index, have you optimized it recently?
>> Are there any searches going on while you are indexing?
>>
>>
>> Regards,
>> Dan
>>
>>
>> -Original Message-
>> From: paul_murd...@emainc.com [mailto:paul_murd...@emainc.com]
>> Sent: Friday, September 11, 2009 7:57 AM
>> To: java-user@lucene.apache.org
>> Subject: RE: Indexing large files? - No answers yet...
>>
>> This issue is still open.  Any suggestions/help with this would be
>> greatly appreciated.
>>
>> Thanks,
>>
>> Paul
>>
>>
>> -Original Message-
>> From: java-user-return-42080-paul_murdoch=emainc@lucene.apache.org
>> [mailto:java-user-return-42080-paul_murdoch=emainc@lucene.apache.org
>> ] On Behalf Of paul_murd...@emainc.com
>> Sent: Monday, August 31, 2009 10:28 AM
>> To: java-user@lucene.apache.org
>> Subject: Indexing large files?
>>
>> Hi,
>>
>>
>>
>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>> consistently receiving "OutOfMemoryError: Java heap spac

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
In this project:
 http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

I concatenate all the text of all of articles of a single journal into
a single text file.
This can create a text file that is 500MB in size.
Lucene is OK in indexing files this size (in parallel even), but I
have a heap size of 8GB.

I would suggest increasing your heap to as large as your machine can
reasonably take.
The reality is that Java programs (like Lucene) take up more memory
than a similar C or even C++ program.
Java may approach C/C++ in speed, but not memory.

We don't use Java because of its memory footprint!  ;-)

See:
 Programming language shootout: speed:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0
 Programming language shootout: memory:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0

-glen

2009/9/11 Dan OConnor :
> Paul:
>
> My first suggestion would be to update your JVM to the latest version (or at 
> least .14). There were several garbage collection related issues resolved in 
> version 10 - 13 (especially dealing with large heaps).
>
> Next, your IndexWriter parameters would help figure out why you are using so 
> much RAM
>getMaxFieldLength()
>getMaxBufferedDocs()
>getMaxMergeDocs()
>getRAMBufferSizeMB()
>
> How often are you calling commit?
> Do you close your IndexWriter after every document?
> How many documents of this size are you indexing?
> Have you used luke to look at your index?
> If this is a large index, have you optimized it recently?
> Are there any searches going on while you are indexing?
>
>
> Regards,
> Dan
>
>
> -Original Message-
> From: paul_murd...@emainc.com [mailto:paul_murd...@emainc.com]
> Sent: Friday, September 11, 2009 7:57 AM
> To: java-user@lucene.apache.org
> Subject: RE: Indexing large files? - No answers yet...
>
> This issue is still open.  Any suggestions/help with this would be
> greatly appreciated.
>
> Thanks,
>
> Paul
>
>
> -Original Message-
> From: java-user-return-42080-paul_murdoch=emainc@lucene.apache.org
> [mailto:java-user-return-42080-paul_murdoch=emainc@lucene.apache.org
> ] On Behalf Of paul_murd...@emainc.com
> Sent: Monday, August 31, 2009 10:28 AM
> To: java-user@lucene.apache.org
> Subject: Indexing large files?
>
> Hi,
>
>
>
> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
> consistently receiving "OutOfMemoryError: Java heap space", when trying
> to index large text files.
>
>
>
> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
> max. heap size.  So I increased the max. heap size to 512 MB.  This
> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
> to do this.  Why so much?
>
>
>
> The class FreqProxTermsWriterPerField appears to be the biggest memory
> consumer by far according to JConsole and the TPTP Memory Profiling
> plugin for Eclipse Ganymede.
>
>
>
> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
> max. heap size.  Increasing the max. heap size to 1024 MB works but
> Lucene uses 826 MB of heap space while performing this.  Still seems
> like way too much memory is being used to do this.  I'm sure larger
> files would cause the error as it seems correlative.
>
>
>
> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
> practice for indexing large files?  Here is a code snippet that I'm
> using:
>
>
>
> // Index the content of a text file.
>
>  private Boolean saveTXTFile(File textFile, Document textDocument)
> throws CIDBException {
>
>
>
>try {
>
>
>
>  Boolean isFile = textFile.isFile();
>
>  Boolean hasTextExtension =
> textFile.getName().endsWith(".txt");
>
>
>
>  if (isFile && hasTextExtension) {
>
>
>
>System.out.println("File " +
> textFile.getCanonicalPath() + " is being indexed");
>
>Reader textFileReader = new
> FileReader(textFile);
>
>if (textDocument == null)
>
>  textDocument = new Document();
>
>textDocument.add(new Field("content",
> textFileReader));
>
>indexWriter.addDocument(textDocument);
> // BREAKS HERE
>
>  }
>
>} catch (FileNotFoundException fnfe) {
>
>  System.out.println(fnfe.getMessage());
>
>  return false;
>
>} catch (CorruptIndexException cie) {
>
>  throw new CIDBException("The index has become
> cor

Re: [EASY]How to change the demo of lucene143 into a multithread one?

2009-08-13 Thread Glen Newton
You are optimizing before the threads are finished adding to the index.
I think this should work:

IndexWriter writer = new IndexWriter("D:\\index", new StandardAnalyzer(),
true);
File file=new File(args[0]);
Thread t1=new Thread(new IndexFiles(writer,file));
Thread t2=new Thread(new IndexFiles(writer,file));
Thread t3=new Thread(new IndexFiles(writer,file));
t1.start();
t2.start();
t3.start();
while(t1.getState()!=State.TERMINATED
||t2.getState()!=State.TERMINATED
||t3.getState()!=State.TERMINATED
){
  try{
   Thread.currentThread().sleep(100l);
   }
  catch(InterruptedException ie)
{
   ie.printStackTrace();
}

}//wait until the threads end.

writer.optimize();
writer.close();
Date end = new Date();

2009/8/13 Chuan SHI :
> Hi all,
>       I am new to multi-thread programming and lucene. I want to change the
> indexing demo of lucene143 into a multi-thread one. I create one instance of
> IndexWriter which is shared by three threads. But I find that the time it
> costs when three threads are used is approximate three times of that of
> single thread.(My computer is dual-core) It seems I write a pseudo
> multi-thread program and it does the same work for three times.
> Following is a snippet of my code. Please tell me how to write the correct
> code. Thanks.
>
> IndexWriter writer = new IndexWriter("D:\\index", new StandardAnalyzer(),
> true);
> File file=new File(args[0]);
> Thread t1=new Thread(new IndexFiles(writer,file));
> Thread t2=new Thread(new IndexFiles(writer,file));
> Thread t3=new Thread(new IndexFiles(writer,file));
> t1.start();
> t2.start();
> t3.start();
> writer.optimize();
> writer.close();
>
> while(t1.getState()!=State.TERMINATED
> ||t2.getState()!=State.TERMINATED
> ||t3.getState()!=State.TERMINATED
> ){}//wait until the threads end.
> Date end = new Date();
>
> --
> Best regards,
>
> Chuan SHI
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Visualizing Semantic Journal Space (large scale) using full-text

2009-07-29 Thread Glen Newton
I thought the Lucene and Solr communities would find this interesting:
My collaborators and I have used LuSql, Lucene and Semantic Vectors to
visualize a large scale semantic journal space (kind of like 'Maps of
Science') of a large
scale (5.7 million articles) journal article collection using only the
full-text (no metadata).

For more info & howto:
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

Glen Newton

-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New tool: LSql

2009-04-14 Thread Glen Newton
LuSql 0.9 comes with Lucene 2.3.1 bundled in the jar (along with
commons-cli-1.1, commons-dbcp-1.2.2, commons-pool-1.4,
mysql-connector-java-5.0.7).

It can run with Lucene 2.4:
If you want to run using Lucene 2.4, put all the above jars in your
classpath, along with the 2.4 jar, run LuSql not using "-jar" but
using the full classpath of: ca.nrc.cisti.lusql.core.LuSqlMain

Let me know if you have any problems.

The next version of LuSql will include the latest stable Lucene
release, in a month or so.

thanks,

Glen

2009/4/14 Greg Shackles :
> This could be very useful.  I see you include Lucene v2.3 in your
> code...does it work correctly with indexes created on v2.4 as well?
> - Greg
>
>
> On Mon, Apr 13, 2009 at 6:49 PM, Glen Newton  wrote:
>
>> As the creator of LuSql
>> [http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql]
>> I would have hoped for a more creative (and more different) name.
>> :-)
>>
>> -glen
>>
>> 2009/4/13 jonathan esposito :
>> > I created a command-line tool in Java that allows the user to execute
>> > sql-like commands against a lucene index.  This is useful for
>> > automating Lucene index migrations without writing any code.
>> > Essentially, you can treat a Lucene index the same as you would a
>> > database.  For example, you can write:
>> >
>> > UPDATE field1=newvalue WHERE +field1:oldvalue
>> >
>> > Or you can simply view data by using the SELECT command:
>> >
>> > SELECT field1,field2 WHERE +field1:value
>> >
>> > You are welcome to visit the project here:
>> > http://code.google.com/p/lucene-sql/.  Any contributions or
>> > suggestions would be greatly appreciated.  I will make an effort to
>> > provide more documentation shortly.
>> >
>> > Thanks,
>> > Jonathan Esposito
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New tool: LSql

2009-04-13 Thread Glen Newton
As the creator of LuSql
[http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql]
I would have hoped for a more creative (and more different) name.
:-)

-glen

2009/4/13 jonathan esposito :
> I created a command-line tool in Java that allows the user to execute
> sql-like commands against a lucene index.  This is useful for
> automating Lucene index migrations without writing any code.
> Essentially, you can treat a Lucene index the same as you would a
> database.  For example, you can write:
>
> UPDATE field1=newvalue WHERE +field1:oldvalue
>
> Or you can simply view data by using the SELECT command:
>
> SELECT field1,field2 WHERE +field1:value
>
> You are welcome to visit the project here:
> http://code.google.com/p/lucene-sql/.  Any contributions or
> suggestions would be greatly appreciated.  I will make an effort to
> provide more documentation shortly.
>
> Thanks,
> Jonathan Esposito
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can I run Lucene in google app engine?

2009-04-13 Thread Glen Newton
Another solution is to have your application on the AppEngine, but the
index is on another machine. Then the application 'proxies' the
requests to the machine that has the index, which is using Solr
[http://lucene.apache.org/solr/] or some other way to expose to the
index to the web.

Yes, this means you still need hosting for the index + Solr.

-glen
http://zzzoot.blogspot.com/

2009/4/13 Chris Lu :
> Surely it's possible, but it has too much limitations to prevent a scalable
> Luceen usage. However, it depends on your requirement.
>
> 1) You can not write index on disk, but you can read files. So theoretically
> if the index is read-only and small, you can package it with the war file.
>
> 2) If you need to update the index, you will have to store the index with
> Google's data store, just like store an index into databases. Sure it'll
> work. But performance would suffer because of transferring the whole index
> into memory, then really start searching. On the other hand, this could be a
> good solution for small index with per-user data.
>
> 3) For large changing indexes, you need to find other solutions to maintain
> lucene index.
>
> My personal opinion is, finding a $20/month VPS hosting is far easier than
> changing the way to code.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
>
> Noble Paul ??? ?? wrote:
>>
>> Is it possible to run Lucene in google app engine? has anyone tried it?
>>
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LuSQL download link error?

2009-04-02 Thread Glen Newton
Dear Shashi,

It should work now.
A temporary failure: our apologies.

thanks,

Glen

2009/4/2 Shashi Kant :
> Hi all, I have been trying to get the latest version of LuSQL from the
> NRC.ca website but get 404s on the download links. I have written to the
> webmaster, but anyone have the jar handy? Could I download from somewhere
> else? or could you email it to me?
>
> thanks,
> Shashi
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Glen Newton
You might try looking in a list that talks about recommender systems.
Google hits:
- http://en.wikipedia.org/wiki/Recommendation_system
- ACM Recommender Systems 2009 http://recsys.acm.org/
- A Guide to Recommender Systems
http://www.readwriteweb.com/archives/recommender_systems.php

2009/3/17 Aaron Schon :
>
> Hi all, Apologies if this question is off-topic, but I was wondering if there 
> is a way of leveraging Lucene (or other mechanism) to store the information 
> about connections and recommend People you might know as done in FB or LI.
>
> The data is as follows:
>
> john_sm...@somedomain.com, jane_...@otherdomain.com
>
>
> john_sm...@somedomain.com, frank_jo...@someotherplace.com
>
> and so on...
>
> how would I go about recommending Jane Doe connecting to Frank Jones?. Hope 
> you can help a newbie by pointing where I should be looking?
>
> Thanks in advance,
> AS
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: public apology for company spam

2009-03-05 Thread Glen Newton
Yonik,

Thank-you for your email. I appreciated and accept your apology.

Indeed the spam was annoying, but I think that you and your colleagues
have significant social capital in the Lucene and Solr communities, so
this minor but unfortunate incident should have minimal impact.

That said, you and your colleagues do not have infinite social
capital, and hopefully you will have no  reason to be forced to spend
this capital in such an unfortunate manner in the future.  :-)

sincerely,

Glen Newton

2009/3/5 Yonik Seeley :
> This morning, an apparently over-zealous marketing firm, on behalf of
> the company I work for, sent out a marketing email to a large number
> of subscribers of the Lucene email lists.  This was done without my
> knowledge or approval, and I can assure you that I'll make all efforts
> to prevent it from happening again.
>
> Sincerest apologies,
> -Yonik
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Merging database index with fulltext index

2009-03-01 Thread Glen Newton
I would suggest you try LuSql, which was designed specifically to
index relational databases into Lucene.

It has an extensive user manual/tutorial which has some complex
examples involving multi-joins and sub-queries.

I am the author of LuSql.
LuSql home page:
http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
LuSql manual: 
http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

thanks,

Glen

2009/2/28  :
> Hi,
>
> what is the best approach to merge a database index with a lucene fulltext
> index? Both databases store a unique ID per doc. This is the join criteria.
>
> requirements:
>
> * both resultsets may be very big (100.000 and much more)
> * the merged resultset must be sorted by database index and/or relevance
> * optional paging the merged resultset, a page has a size of 1000 docs max.
>
> example:
>
> select a, b from dbtable where c = 'foo' and content='bar' order by
> relevance, a desc, d
>
> I would split this into:
>
> database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
> lucene: content:bar (sort:relevance)
> merge: loop over the lucene resultset and add the db record into a new list
> if the ID matches.
>
> If the resultset must be paged:
>
> database: select ID from dbtable where c = 'foo' order by a desc, d
> lucene: content:bar (sort:relevance)
> merge: loop over the lucene resultset and add the db record into a new list
> if the ID matches.
> page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
> page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
> ...
>
>
> Is there a better way?
>
> Thank you.
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-19 Thread Glen Newton
 1165 MHz UltraSPARC-T2
>  P32: 1165 MHz UltraSPARC-T2
>  P33: 1165 MHz UltraSPARC-T2
>  P34: 1165 MHz UltraSPARC-T2
>  P35: 1165 MHz UltraSPARC-T2
>  P36: 1165 MHz UltraSPARC-T2
>  P37: 1165 MHz UltraSPARC-T2
>  P38: 1165 MHz UltraSPARC-T2
>  P39: 1165 MHz UltraSPARC-T2
>  P40: 1165 MHz UltraSPARC-T2
>  P41: 1165 MHz UltraSPARC-T2
>  P42: 1165 MHz UltraSPARC-T2
>  P43: 1165 MHz UltraSPARC-T2
>  P44: 1165 MHz UltraSPARC-T2
>  P45: 1165 MHz UltraSPARC-T2
>  P46: 1165 MHz UltraSPARC-T2
>  P47: 1165 MHz UltraSPARC-T2
>  P48: 1165 MHz UltraSPARC-T2
>  P49: 1165 MHz UltraSPARC-T2
>  P50: 1165 MHz UltraSPARC-T2
>  P51: 1165 MHz UltraSPARC-T2
>  P52: 1165 MHz UltraSPARC-T2
>  P53: 1165 MHz UltraSPARC-T2
>  P54: 1165 MHz UltraSPARC-T2
>  P55: 1165 MHz UltraSPARC-T2
>  P56: 1165 MHz UltraSPARC-T2
>  P57: 1165 MHz UltraSPARC-T2
>  P58: 1165 MHz UltraSPARC-T2
>  P59: 1165 MHz UltraSPARC-T2
>  P60: 1165 MHz UltraSPARC-T2
>  P61: 1165 MHz UltraSPARC-T2
>  P62: 1165 MHz UltraSPARC-T2
>  P63: 1165 MHz UltraSPARC-T2
> OS release detail:
> Solaris 10 5/08 s10s_u5wos_10 SPARC Copyright 2008 Sun Microsystems, Inc.
>  All Rights Reserved. Use is subject to license terms. Assembled 24 March
> 2008
>
>Workload Measurements
>
> Observed system for 10 min
>   in intervals of10 sec
> Cycles  44768051692942
> Instructions3980806371547
> CPI 11.25 **
> FP instructions 5821938521
> Emulated FP instructions0
> FP Percentage0.1%
> The following applies to the measurement interval with the
> busiest single thread or process:
> Peak thread utilization at  2009-02-13 14:36:16
>  Corresponding file name  1234515976
>  CPU utilization   39.4%
>  Command  java
>  PID/LWPID16396/1
>  Thread utilization   0.6%
> More detail on processes and threads is in data/process.out
>
> **Cycles per Instruction (CPI) is not comparable between UltraSPARC
> T1 and T2 processors and conventional processors. Conventional
> processors execute an idle loop when there is no work to do, so
> CPI may be artificially low, especially when the system is
> somewhat idle. The UltraSPARC T1 and T2 "park" idle threads,
> consuming no energy, when there is no work to do, so CPI may
> be artificially high, especially when the system is somewhat idle.
>
>Advice
>
> Floating Point GREEN
>  Observed floating point content was not excessive for
>  an UltraSPARC T1 processor. Floating point content is not
>  a limitation for UltraSPARC T2.
>
> ParallelismGREEN
>  The observed workload has sufficient threads of execution to
>  efficiently utilize the multiple cores and threads of an
>  UltraSPARC T1 or UltraSPARC T2 processor.
>
>
>
> Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
>
>
>
> Glen Newton wrote:
>>
>> Could you give some configuration details:
>> - Solaris version
>> - Java VM version, heap size, and any other flags
>> - disk setup
>>
>> You should also consider using huge pages (see
>>
>> http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html)
>>
>> I will also be posting performance gains using huge pages for Java
>> Lucene Linux large scale indexing in the next week or so...
>>
>> -glen
>>
>>
>>
>> 2009/2/18 Varun Dhussa :
>>
>>>
>>> Hi,
>>>
>>> I have had a bad experience when migrating my application from Intel Xeon
>>> based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search
>>> just
>>> does not perform. A search which took approximately 500 ms takes more
>>> than 6
>>> seconds to execute.
>>>
>>> The index has about 100,000,000 records. So, I tried to split it into 10
>>> indices and used the ParallelSearcher on it, but still got similar
>>> results.
>>>
>>> I am guessing that this is because the distance implementation used by
>>> Lucene requires higher clock speed and can't be parallelized much.
>>>
>>> Please advice
>>>
>>> --
>>> Varun Dhussa
>>> Product Architect
>>> CE InfoSystems (P) Ltd
>>> http://www.mapmyindia.com
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Glen Newton
Could you give some configuration details:
- Solaris version
- Java VM version, heap size, and any other flags
- disk setup

You should also consider using huge pages (see
http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html)

I will also be posting performance gains using huge pages for Java
Lucene Linux large scale indexing in the next week or so...

-glen



2009/2/18 Varun Dhussa :
> Hi,
>
> I have had a bad experience when migrating my application from Intel Xeon
> based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just
> does not perform. A search which took approximately 500 ms takes more than 6
> seconds to execute.
>
> The index has about 100,000,000 records. So, I tried to split it into 10
> indices and used the ParallelSearcher on it, but still got similar results.
>
> I am guessing that this is because the distance implementation used by
> Lucene requires higher clock speed and can't be parallelized much.
>
> Please advice
>
> --
> Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Visualization

2009-02-12 Thread Glen Newton
V1 of a project of mine, Ungava[1], which uses Lucene to index
research articles and library catalog metadata, also uses Project
Simile's Metaphor and Timeline. I have some simple examples using
them:

Here is the search for "cell" in articles:
 
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?collection=jos&contents=cell

Here is a Timeline view of the search "cell" for articles:
 
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?calyHandler=timeLineView&collection=jos&contents=cell

here is the Exhibit view:
 
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?calyHandler=exhibit&collection=jos&contents=cell

Here is the keyword drill cloud[2] view:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?tagCloud=true&collection=jos&tagField=keyword&contents=cell&numCloudDocs=200&numCloudTags=50


Here is the "cell" search of the library catalog:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?collection=csu&title=cell&sauthor=&keyword=&syear=-1&eyear=-1&sortBy=relevance

Timeline view:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?calyHandler=timeLineView&collection=csu&title=cell

Subject Drill Cloud:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?tagCloud=true&collection=csu&tagField=keyword&title=cell&numCloudDocs=200&numCloudTags=50&sortBy=relevance

-Glen

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Ungava
[2]http://zzzoot.blogspot.com/2007/10/drill-clouds-for-search-refinement-id.html



2009/2/12 Omar Alonso :
> Hi,
>
> Depends on the kind of work that you want to do. For trying ideas I think is 
> pretty cool. I've used for visualizing the DBLP data set and it was OK. I 
> also played with an early version of LabEscape for TreeMaps. There is a paper 
> on the project in case  you want to take a look: 
> www.oracle.com/technology/tech/semantic_technologies/pdf/informationgrid_oracle.pdf.
>  BTW, I'm not longer at Oracle but I'm happy to answer questions that you may 
> have.
>
> Another toolkit that I like is SIMILE (http://simile.mit.edu/).
>
> Regards,
>
> o.
>
> --- On Thu, 2/12/09, Shashi Kant  wrote:
>
> From: Shashi Kant 
> Subject: Re: Visualization
> To: java-user@lucene.apache.org
> Date: Thursday, February 12, 2009, 3:05 AM
>
> Thanks Omar, I have looked at Prefuse.
> What has been your experience with it given it is still in beta? any 
> "gotchas" we should look out for?
>
> regards,
> shashi
>
>
>
>
>
> - Original Message 
> From: Omar Alonso 
> To: java-user@lucene.apache.org; Shashi Kant 
> Sent: Thursday, February 12, 2009 4:38:29 AM
> Subject: Re: Visualization
>
> prefuse.org
>
>
> --- On Thu, 2/12/09, Shashi Kant  wrote:
>
> From: Shashi Kant 
> Subject: Visualization
> To: java-user@lucene.apache.org
> Date: Thursday, February 12, 2009, 12:53 AM
>
> Hi all,
>
> Apologies for being slightly off-topic, we are looking at novel visualization 
> approaches for rendering results from Lucene queries. I was wondering if you 
> have any recommendations for visualization toolkits (Java) for displaying 
> heat-maps, dendrograms, cluster maps etc. (preferably free/OSS)
>
> Thanks in advance!
> Shashi
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [ANN] Lucid Imagination

2009-01-26 Thread Glen Newton
Congrats & good-luck on this new endeavour!

-Glen  :-)

2009/1/26 Grant Ingersoll :
> Hi Lucene and Solr users,
>
> As some of you may know, Yonik, Erik, Sami, Mark and I teamed up with
> Marc Krellenstein to create a company to provide commercial
> support (with SLAs), training, value-add components and services to
> users of Lucene and Solr.  We have been relatively quiet up until now as we
> prepare our
> offerings, but I am now pleased to announce the official launch of
> Lucid Imagination.  You can find us at http://www.lucidimagination.com/
> and learn more about us at http://www.lucidimagination.com/About/.
>
> We have also launched a beta search site dedicated to searching all
> things in the Lucene ecosystem: Lucene, Solr, Tika, Mahout, Nutch,
> Droids, etc.  It's powered, of course, by Lucene via Solr (we'll
> provide details in a separate message later about our setup.)  You can
> search the Lucene family of websites, wikis, mail archives and JIRA issues
> all in one place.
> To try it out, browse to http://www.lucidimagination.com/search/.
>
> Any and all feedback is welcome at f...@lucidimagination.com.
>
> Thanks,
> Grant
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
>
>
>
>
>
>
>
>
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: clustering with compass & terracotta

2009-01-15 Thread Glen Newton
There is a discussion here:
 http://www.terracotta.org/web/display/orgsite/Lucene+Integration

Also of interest: "Katta - distribute lucene indexes in a grid"
http://katta.wiki.sourceforge.net/

-glen

http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html
http://zzzoot.blogspot.com/2008/11/software-announcement-lusql-database-to.html
http://zzzoot.blogspot.com/2008/09/katta-released-lucene-on-grid.html
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html


2009/1/15 Angel, Eric :
> I just ran into this
> http://www.compass-project.org/docs/2.0.0/reference/html/needle-terracot
> ta.html and was wondering if any of you had tried anything like this and
> if so, what your experience was like.
>
>
>
> Eric
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Help with installing Lucene

2009-01-07 Thread Glen Newton
> I'm not sure if it's a better idea to use something like Solr or start from
> scratch and customize the application as I move forward. What do you think

LuSql might be appropriate for your needs:
"LuSql is a high-performance, simple tool for indexing data held in a
DBMS into a Lucene index. It can use any JDBC-aware SQL database."
http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclaimer: I am the author of LuSql.

-Glen


2009/1/7 ahammad :
>
>
>
> Greg Shackles wrote:
>>
>>
>> Depending on what you need, there might be something already built that
>> can
>> do what you want.  I can't look up links right now but you might want to
>> look into Solr and see if that works for what you want.  Otherwise, I
>> think
>> there are code samples and whatnot on the Lucene site to help get you
>> started writing your own application.  It's very easy to use : )
>>
>> - Greg
>>
>>
>
>
> Essentially, we have a database (can't recall if it is Oracle or MSSQL) that
> contains a bunch of articles. There is a website with search functionality
> that allows the user to retrieve those articles and display them on the
> page. Essentially it's like a Wikipedia type website. If it's relevant, I'll
> see if I can get the existing architecture that we currently use.
>
> I'm not sure if it's a better idea to use something like Solr or start from
> scratch and customize the application as I move forward. What do you think?
>
> Thanks for all the replies btw.
> --
> View this message in context: 
> http://www.nabble.com/Help-with-installing-Lucene-tp21332541p21336546.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FastSSFuzzy for faster fuzzy queries in Lucene

2009-01-06 Thread Glen Newton
- Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/
- Paper: Fast Similarity Search in Large Dictionaries.
http://fastss.csg.uzh.ch/ifi-2007.02.pdf
- FastSimilarSearch.java http://fastss.csg.uzh.ch/FastSimilarSearch.java
- Paper: Fast Similarity Search in Peer-to-Peer Networks.
  http://www.globis.ethz.ch/script/publication/download?docid=506

-Glen
http://zzzoot.blogspot.com


2009/1/5 Grant Ingersoll :
> Do you have a reference paper/link on it?  Sounds interesting.
>
> On Jan 5, 2009, at 8:17 PM, Jason Rutherglen wrote:
>
>> Hello,
>>
>> I'm interested in getting FastSSFuzzy into Lucene, perhaps as a contrib
>> module.  One question is how much would the index grow?  We've got a list
>> of
>> people's names we want to do spellchecking on for example.
>>
>> -J
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
Oops. Thanks!  :-)

2008/12/10 Gary Moore <[EMAIL PROTECTED]>:
>  svn co https://bobo-browse.svn.sourceforge.net/svnroot/bobo-browse/trunk
> bobo-browse
> -Gary
> Glen Newton wrote:
>>
>> I don't think this is an Open Source project: I couldn't find any
>> source on the site and the only download is a jar with .class files...
>>
>> -glen
>>
>> 2008/12/10 John Wang <[EMAIL PROTECTED]>:
>>
>>>
>>> www.browseengine.com
>>> -John
>>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
I don't think this is an Open Source project: I couldn't find any
source on the site and the only download is a jar with .class files...

-glen

2008/12/10 John Wang <[EMAIL PROTECTED]>:
> www.browseengine.com
> -John
>
> On Wed, Dec 10, 2008 at 10:55 AM, Glen Newton <[EMAIL PROTECTED]> wrote:
>
>> From what I understand:
>> faceted browse is a taxonomy of depth =1
>>
>> A taxonomy in general has an arbitrary depth:
>>
>> Example: Biological taxonomy:
>>
>> Kingdom Animalia
>>   Phylum Acanthocephala
>>  Class Archiacanthocephala
>>   Phylum Annelida
>> Kingdom Fungi
>>   Phylum Ascomycota
>>  Class Ascomycetes
>> Order Acarosporales
>>Family Acarosporaceae
>>   Genus Acarospora
>>  Acarospora admissa
>>
>> -glen
>>
>>
>>
>> 2008/12/10 Karsten F. <[EMAIL PROTECTED]>:
>> >
>> > Hi Dipak,
>> >
>> > Which kind of "Taxonomy"?
>> > Where is the difference to "faceted browsing" in your case?
>> >
>> > best regards
>> >  Karsten
>> >
>> >
>> > Kesarkar, Dipak wrote:
>> >>
>> >> Hi
>> >>
>> >> I want to include Taxonomy feature in my search.
>> >>
>> >> Does Lucene support Taxonomy? How?
>> >>
>> >> If not, is there in different way to add Taxonomy feature in the Lucene
>> >> search?
>> >>
>> > --
>> > View this message in context:
>> http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20937717.html
>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>>
>>
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
>From what I understand:
faceted browse is a taxonomy of depth =1

A taxonomy in general has an arbitrary depth:

Example: Biological taxonomy:

Kingdom Animalia
   Phylum Acanthocephala
  Class Archiacanthocephala
   Phylum Annelida
Kingdom Fungi
   Phylum Ascomycota
  Class Ascomycetes
 Order Acarosporales
Family Acarosporaceae
   Genus Acarospora
  Acarospora admissa

-glen



2008/12/10 Karsten F. <[EMAIL PROTECTED]>:
>
> Hi Dipak,
>
> Which kind of "Taxonomy"?
> Where is the difference to "faceted browsing" in your case?
>
> best regards
>  Karsten
>
>
> Kesarkar, Dipak wrote:
>>
>> Hi
>>
>> I want to include Taxonomy feature in my search.
>>
>> Does Lucene support Taxonomy? How?
>>
>> If not, is there in different way to add Taxonomy feature in the Lucene
>> search?
>>
> --
> View this message in context: 
> http://www.nabble.com/Taxonomy-in-Lucene-tp20929487p20937717.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NIOFSDirectory

2008-12-05 Thread Glen Newton
Understood. Thanks! :-)

-glen

2008/12/4 John Wang <[EMAIL PROTECTED]>:
> NIOFSDirectory.getDirectory simple calls the static method on the parent
> class: FSDirectory.getDirectory.
> Which returns an instance of FSDirectory.
>
> IMO: NIOFSDirectory solves concurrent read problems, generally you don't
> want concurrent writes.
>
> -John
>
> On Thu, Dec 4, 2008 at 2:44 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
>
>> Am I missing something here?
>>
>> Why not use:
>>  IndexWriter writer = new IndexWriter(NIOFSDirectory.getDirectory(new
>> File(filename), analyzer, true);
>>
>> Another question: is NIOFSDirectory to be used with IndexWriter? If
>> no, could someone explain?
>>
>> thanks,
>> -glen
>>
>>
>> 2008/12/4 John Wang <[EMAIL PROTECTED]>:
>> > Thanks!
>> > -John
>> >
>> > On Thu, Dec 4, 2008 at 2:16 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>> >
>> >> Details in the bug:
>> >> https://issues.apache.org/jira/browse/LUCENE-1451
>> >>
>> >> Use this constructor to create an instance of NIODirectory:
>> >>
>> >>  /** Create a new NIOFSDirectory for the named location.
>> >>   *
>> >>   * @param path the path of the directory
>> >>   * @param lockFactory the lock factory to use, or null for the default.
>> >>   * @throws IOException
>> >>   */
>> >>  public NIOFSDirectory(File path, LockFactory lockFactory) throws
>> >> IOException {
>> >>super(path, lockFactory);
>> >>  }
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> On Thu, Dec 4, 2008 at 5:08 PM, John Wang <[EMAIL PROTECTED]> wrote:
>> >> > That does not help. The File/path is not stored with the instance. It
>> is
>> >> in
>> >> > a map FSDirectory keeps statically. Should subclasses of FSDirectory
>> be
>> >> > modifying the map?
>> >> > This is not a question about how to subclass or customize FSDirectory.
>> >> This
>> >> > is more on how to use NIOFSDirectory class. I am hoping for a simply
>> >> answer,
>> >> > is what I am doing (setting the class name statically on system
>> property)
>> >> > the right way?
>> >> >
>> >> > -John
>> >> >
>> >> > On Thu, Dec 4, 2008 at 2:00 PM, Yonik Seeley <[EMAIL PROTECTED]>
>> wrote:
>> >> >
>> >> >> On Thu, Dec 4, 2008 at 4:32 PM, Glen Newton <[EMAIL PROTECTED]>
>> >> wrote:
>> >> >> > Sorrywhat version are we talking about?  :-)
>> >> >>
>> >> >> The current development version of Lucene allows you to directly
>> >> >> instantiate FSDirectory subclasses.
>> >> >>
>> >> >> -Yonik
>> >> >>
>> >> >>
>> >> >> > thanks,
>> >> >> >
>> >> >> > Glen
>> >> >> >
>> >> >> > 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>:
>> >> >> >> On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]>
>> >> wrote:
>> >> >> >>> Hi guys:
>> >> >> >>>We did some profiling and benchmarking:
>> >> >> >>>
>> >> >> >>>The thread contention on FSDIrectory is gone, and for the set
>> of
>> >> >> queries
>> >> >> >>> we are running, performance improved by a factor of 5 (to be
>> >> >> conservative).
>> >> >> >>>
>> >> >> >>>Great job, this is awesome, a simple change and made a huge
>> >> >> difference.
>> >> >> >>>
>> >> >> >>>To get NIOFSDirectory installed, I didn't find any
>> documentation
>> >> >> >>> (doesn't mean there aren't any), after reading the code, I
>> resorted
>> >> to:
>> >> >> >>>
>> >> >> >>>  static
>> >> >> >>>  {
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>> System.setProperty("org.apache.lucene.FSDirectory.class",NIOFSDirectory.class.getName());
>> >>

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
Am I missing something here?

Why not use:
 IndexWriter writer = new IndexWriter(NIOFSDirectory.getDirectory(new
File(filename), analyzer, true);

Another question: is NIOFSDirectory to be used with IndexWriter? If
no, could someone explain?

thanks, 
-glen


2008/12/4 John Wang <[EMAIL PROTECTED]>:
> Thanks!
> -John
>
> On Thu, Dec 4, 2008 at 2:16 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
>> Details in the bug:
>> https://issues.apache.org/jira/browse/LUCENE-1451
>>
>> Use this constructor to create an instance of NIODirectory:
>>
>>  /** Create a new NIOFSDirectory for the named location.
>>   *
>>   * @param path the path of the directory
>>   * @param lockFactory the lock factory to use, or null for the default.
>>   * @throws IOException
>>   */
>>  public NIOFSDirectory(File path, LockFactory lockFactory) throws
>> IOException {
>>super(path, lockFactory);
>>  }
>>
>> -Yonik
>>
>>
>> On Thu, Dec 4, 2008 at 5:08 PM, John Wang <[EMAIL PROTECTED]> wrote:
>> > That does not help. The File/path is not stored with the instance. It is
>> in
>> > a map FSDirectory keeps statically. Should subclasses of FSDirectory be
>> > modifying the map?
>> > This is not a question about how to subclass or customize FSDirectory.
>> This
>> > is more on how to use NIOFSDirectory class. I am hoping for a simply
>> answer,
>> > is what I am doing (setting the class name statically on system property)
>> > the right way?
>> >
>> > -John
>> >
>> > On Thu, Dec 4, 2008 at 2:00 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>> >
>> >> On Thu, Dec 4, 2008 at 4:32 PM, Glen Newton <[EMAIL PROTECTED]>
>> wrote:
>> >> > Sorrywhat version are we talking about?  :-)
>> >>
>> >> The current development version of Lucene allows you to directly
>> >> instantiate FSDirectory subclasses.
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> > thanks,
>> >> >
>> >> > Glen
>> >> >
>> >> > 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>:
>> >> >> On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]>
>> wrote:
>> >> >>> Hi guys:
>> >> >>>We did some profiling and benchmarking:
>> >> >>>
>> >> >>>The thread contention on FSDIrectory is gone, and for the set of
>> >> queries
>> >> >>> we are running, performance improved by a factor of 5 (to be
>> >> conservative).
>> >> >>>
>> >> >>>Great job, this is awesome, a simple change and made a huge
>> >> difference.
>> >> >>>
>> >> >>>To get NIOFSDirectory installed, I didn't find any documentation
>> >> >>> (doesn't mean there aren't any), after reading the code, I resorted
>> to:
>> >> >>>
>> >> >>>  static
>> >> >>>  {
>> >> >>>
>> >> >>>
>> >>
>> System.setProperty("org.apache.lucene.FSDirectory.class",NIOFSDirectory.class.getName());
>> >> >>>  }
>> >> >>>   I am sure this is not the intended usage, as this is really ugly.
>> >> What is
>> >> >>> the suggested usage?
>> >> >>
>> >> >> Instantiate NIOFSDirectory directly and pass it to the
>> >> IndexReader.open()
>> >> >>
>> >> >> -Yonik
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> > -
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >> >
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
Sorrywhat version are we talking about?  :-)

thanks,

Glen

2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>:
> On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote:
>> Hi guys:
>>We did some profiling and benchmarking:
>>
>>The thread contention on FSDIrectory is gone, and for the set of queries
>> we are running, performance improved by a factor of 5 (to be conservative).
>>
>>Great job, this is awesome, a simple change and made a huge difference.
>>
>>To get NIOFSDirectory installed, I didn't find any documentation
>> (doesn't mean there aren't any), after reading the code, I resorted to:
>>
>>  static
>>  {
>>
>> System.setProperty("org.apache.lucene.FSDirectory.class",NIOFSDirectory.class.getName());
>>  }
>>   I am sure this is not the intended usage, as this is really ugly. What is
>> the suggested usage?
>
> Instantiate NIOFSDirectory directly and pass it to the IndexReader.open()
>
> -Yonik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene nicking my memory ?

2008-12-03 Thread Glen Newton
Hi Magnus,

Could you post the OS, version, RAM size, swapsize, Java VM version,
hardware, #cores, VM command line parameters, etc? This can be very
relevant.

Have you tried other garbage collectors and/or tuning as described in
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html?

2008/12/3 Magnus Rundberget <[EMAIL PROTECTED]>:
> Hi,
>
> We have an application using Tomcat, Spring etc and Lucene 2.4.0.
> Our index is about 100MB (in test) and has about 20 indexed fields.
>
> Performance is pretty good, but we are experiencing a very high usage of
> memory when searching.
>
> Looking at JConsole during a somewhat silly scenario (but illustrates the
> problem);
> (Allocated 512 MB Min heap space, max 1024)
>
> 0. Initially memory usage is about 70MB
> 1. Search for word "er", heap memory usage goes up by 100-150MB
> 1.1 Wait for 30 seconds... memory usage stays the same (ie no gc)
> 2. Search by word "og", heap memory usage goes up another 50-100MB
> 2.1 See 1.1
>
> ...and so on until it seems to reach the 512 MB limit, and then a garbage
> collection is performed
> i.e garbage collection doesn't seem to occur until it "hits the roof"
>
> We believe the scenario is similar in production, were our heap space is
> limited to 1.5 GB.
>
>
> Our search is basically as follows
> --
> 1. Open an IndexSearcher
> 2. Build a Boolean Query searching across 4 fields (title, summary, content
> and daterangestring MMDD)
> 2.1 Sort on title
> 3. Perform search
> 4. Iterate over hits to build a set of custom result objects (pretty small,
> as we dont include content in these)
> 5. Close searcher
> 6. Return result objects.

You should not close the searcher: it can be shared by all queries.
What happens when you warm Lucene with a (large) number of queries: do
things stabilize over time?

A 100MB index is (relatively) very small for Lucene (I have indexes
>100GB). What kind of response times are you getting, independent of
memory usage.

-glen

>
> We have tried various options based on entries on this mailing list;
> a) Cache the IndexSearcher - Same results
> b) Remove sorting - Same result
> c) In point 4 only iterating over a limited amount of hits rather than whole
> collection - Same result in terms of memory usage, but obviously increased
> performance
> d) Using RamDirectory vs FSDirectory - Same result only initial heap usage
> is higher using ramdirectory (in conjuction with cached indexsearcher)
>
>
> Doing some profiling using YourKit shows a huge number of char[], int[] and
> string[], and ever increasing number of lucene related objects.
>
>
>
> Reading through the mailing lists, suspicions are that our problem is
> related to ThreadLocals and memory not being released. Noticed that there
> was a related patch for this in 2.4.0, but it doesn't seem to help us much.
>
> Any ideas ?
>
> kind regards
> Magnus
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Merging indexes & multicore/multithreading

2008-12-02 Thread Glen Newton
Let's say I have 8 indexes on a 4 core system and I want to merge them
(inside a single vm instance).
Is it better to do a single merge of all 8, or to in parallel threads
merge in pairs, until there is only a single index left? I guess the
question involves how multi-threaded merging is and if it will take
advantage of all cores.

thanks,

-glen

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 2.3.1 vs 2.4 benchmarks using LuSql

2008-11-24 Thread Glen Newton
I have some simple indexing benchmarks comparing Lucene 2.3.1 with 2.4:
 http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html

In the next couple of days I will be running benchmarks comparing
Solr's DataImportHandler/JdbcDataSource indexing performance with
LuSql and will release them ASAP.

thanks,

Glen

PS. Previous Lucene benchmarks:
- http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
- http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
- http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Software Announcement: LuSql: Database to Lucene indexing

2008-11-17 Thread Glen Newton
LuSql is a simple but powerful tool for building Lucene indexes from
relational databases. It is a command-line Java application for the
construction of a Lucene index from an arbitrary SQL query of a
JDBC-accessible SQL database. It allows a user to control a number of
parameters, including the SQL query to use, individual
indexing/storage/term-vector nature of fields, analyzer, stop word
list, and other tuning parameters. In its default mode it uses
threading to take advantage of multiple cores.

LuSql can handle complex queries, allows for additional per record
sub-queries, and has a plug-in architecture for arbitrary Lucene
document manipulation. Its only dependencies are three Apache Commons
libraries, the Lucene core itself, and a JDBC driver.

LuSql has been extensively tested, including a large 6+ million
full-text & metadata journal article document collection, producing an
86GB Lucene index in ~13 hours.

http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Glen Newton

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Global" Field question (thread-safe)?

2008-11-06 Thread Glen Newton
Thanks!  :-)

2008/11/6 Michael McCandless <[EMAIL PROTECTED]>:
>
> The field never changes across all docs?  If so, this will work fine.
>
> Mike
>
> Glen Newton wrote:
>
>> I have a use case where I want all of my documents to have - in
>> addition to their other fields - a  single field=value.
>> An example use is where I have multiple Lucene indexes that I search
>> in parallel, but still need to distinguish them.
>> Index 1: All documents have: source="a1"
>> Index 2: All documents have: source="a2"
>>
>> This is a common use case that has previously been discussed on this list.
>>
>> The particular question I have is: when I am indexing, can I create a
>> single Field and use it for all Documents? Note I am in a
>> multithreaded environment, so many Documents are created and will have
>> this same Field added to them, and subsequently indexed.
>>
>> So are their any threading issues with this particular usage?
>>
>> thanks,
>>
>> Glen
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



"Global" Field question (thread-safe)?

2008-11-06 Thread Glen Newton
I have a use case where I want all of my documents to have - in
addition to their other fields - a  single field=value.
An example use is where I have multiple Lucene indexes that I search
in parallel, but still need to distinguish them.
Index 1: All documents have: source="a1"
Index 2: All documents have: source="a2"

This is a common use case that has previously been discussed on this list.

The particular question I have is: when I am indexing, can I create a
single Field and use it for all Documents? Note I am in a
multithreaded environment, so many Documents are created and will have
this same Field added to them, and subsequently indexed.

So are their any threading issues with this particular usage?

thanks,

Glen

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document thread safe?

2008-10-31 Thread Glen Newton
Yes, the problem goes away when I do the following:
 synchronized(doc)
{
   doc.add(field);
}

Thanks.

[I'll use a Lock to do this properly]

-glen

2008/10/31 Yonik Seeley <[EMAIL PROTECTED]>:
> On Fri, Oct 31, 2008 at 11:53 AM, Glen Newton <[EMAIL PROTECTED]> wrote:
>> I have concurrent threads adding Fields to the same Document, but
>> getting some odd behaviour.
>> Before going into too much depth, is Document thread-safe?
>
> No, it's not.
> synchronizing on Document when adding a new field would probably be
> the easiest fix for you.
>
> -Yonik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document thread safe?

2008-10-31 Thread Glen Newton
Hello,

I am using Lucene 2.3.1.

I have concurrent threads adding Fields to the same Document, but
getting some odd behaviour.
Before going into too much depth, is Document thread-safe?

thanks,

Glen

http://zzzoot.blogspot.com/

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Michael McCandless <[EMAIL PROTECTED]>:
>
> Mark Miller wrote:
>
>> Glen Newton wrote:
>>>
>>> 2008/10/23 Mark Miller <[EMAIL PROTECTED]>:
>>>
>>>> It sounds like you might have some thread synchronization issues outside
>>>> of
>>>> Lucene. To simplify things a bit, you might try just using one
>>>> IndexWriter.
>>>> If I remember right, the IndexWriter is now pretty efficient, and there
>>>> isn't much need to index to smaller indexes and then merge. There is a
>>>> lot
>>>> of juggling to get wrong with that approach.
>>>>
>>>
>>> While I agree it is easier to have a single IndexWriter, if you have
>>> multiple cores you will get significant speed-ups with multiple
>>> IndexWriters, even with the impact of merging at the end.
>>> #IndexWriters = # physical cores is an reasonable rule of thumb.
>>>
>>> General speed-up estimate: # cores * 0.6 - 0.8  over single IndexWriter
>>> YMMV
>>>
>>> When I get around to it, I'll re-run my tests varying the # of
>>> IndexWriters & post.
>>>
>>> -Glen
>>>
>> Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as
>> efficient as using Multiple Writers? Where do you suppose the hold up is?
>> Number of threads doing merges? Sync contention? I hate the idea of multiple
>> IndexWriter/Readers being more efficient than a single instance. In an ideal
>> Lucene world, a single instance would hide the complexity and use the number
>> of threads needed to match multiple instance performance.
>
> Honestly this surprises me: I would expect a single IndexWriter with
> multiple threads to be as fast (or faster, considering the extra merge time
> at the end) than multiple IndexWriters.
>
> IndexWriter's concurrency has improved alot lately, with
> ConcurrentMergeScheduler.  The only serious operation that is not concurrent
> is flushing the RAM buffer as a new segment; but in a well tuned indexing
> process (large RAM buffer) the time spent there should be quite small,
> especially with a fast IO system.
>
> Actually, addIndexes is also not concurrent in that if multiple threads call
> it, only one can run at once.  But normally you would call it with all the
> indices you want to add, and then the merging is concurrent.
>
> Glen, in your single IndexWriter test, is it possible there was accidental
> thread contention during document preparation or analysis?

I don't think there is. I've been refining this for quite a while, and
have done a lot of analysis and hand-checking of the threading stuff.

I do use multiple threads for document creation: this is where much of
the speed-up happens (at least in my case where I have a large indexed
field for the full-text of an article: the parsing becomes a
significant part of the process).

> I do agree that we should strive to have enough concurrency in IndexWriter
> and IndexReader so that you don't get any real benefit by using separate
> instances. Eg in 2.4.0 you can now open read-only IndexReaders, and on Unix
> you can use NIOFSDirectory, both of which should go a long ways towards
> fixing IndexReader's concurrency issue.

My original tests were in the Spring with 2.3.1. I am planning on
doing the new tests with 2.4 for indexing, as well as re-doing my
concurrent query tests[1] and concurrent multiple reader tests[2]
using the features you describe. I am sure the results will be quite
different...

BTW the files I am indexing were originally PDFs, but were batch
converted to text and stored compressed on the filesystem, so except
for GUnzipping them there is no other overhead.

[1]http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
[2]http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html

-glen

> Mike
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Mark Miller <[EMAIL PROTECTED]>:
> It sounds like you might have some thread synchronization issues outside of
> Lucene. To simplify things a bit, you might try just using one IndexWriter.
> If I remember right, the IndexWriter is now pretty efficient, and there
> isn't much need to index to smaller indexes and then merge. There is a lot
> of juggling to get wrong with that approach.

While I agree it is easier to have a single IndexWriter, if you have
multiple cores you will get significant speed-ups with multiple
IndexWriters, even with the impact of merging at the end.
#IndexWriters = # physical cores is an reasonable rule of thumb.

General speed-up estimate: # cores * 0.6 - 0.8  over single IndexWriter
YMMV

When I get around to it, I'll re-run my tests varying the # of
IndexWriters & post.

-Glen

>
> - Mark
>
> Sudarsan, Sithu D. wrote:
>>
>> Hi,
>>
>> We are trying to index large collection of PDF documents, sizes varying
>> from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
>> text extraction) and on Windows as well as CentOS Linux. Used java -Xms
>> and -Xmx options, both at 1080m, even though we have 4GB on Windows and
>> 32 GB on Linux with sufficient swap space.
>>
>> With just one thread, though it takes time, the indexing happens. To
>> speed up, we tried multi-threaded approach with one Indexwriter for each
>> thread. After all the threads finish their indexing, they are merged.
>> With about 100 sample files and 10 threads, the program works pretty
>> well and it does speed up. But, when we run on document collection of
>> about 25GB, couple of threads just hang, while the rest have completed
>> their indexing. The program never gracefully exits, and the threads that
>> seem to have died ensure that the final index merging does not take
>> place. The program needs to be manually terminated.
>> Tried both with simple analyzer as well as standard analyzer, with
>> similar results.
>>
>> Any useful tips / solutions welcome.
>>
>> Thanks in advance,
>> Sithu Sudarsan
>> Graduate Research Assistant, UALR
>> & Visiting Researcher, CDRH/OSEL
>>
>> [EMAIL PROTECTED]
>> [EMAIL PROTECTED]
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
You might want to look at my indexing of 6.4 million PDF articles,
full-text and metadata. It resulted in an 83GB index taking 20.5 hours
to run. It uses multiple writers, is massively multithreaded.

More info here:
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
Check-out the notes at the bottom for details.

In order to make threading/queues much easier and more robust, you
want to use: java.util.concurrent.ThreadPoolExecutor
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html

Even with these, I've also had problems like you describe. One thing
I've found is that you need to shut the  ThreadPoolExecutor down
correctly, something like:
 threadPoolExecutor.shutdown();
while(!threadPoolExecutor.isTerminated())
{
try {
Thread.sleep(ShutdownDelay);
} catch (InterruptedException ie) {
System.out.println(" interrupted");
}
}

You also need to simplify your threading so as to make reduce deadlock
possibilities.

I hope this is useful.

-Glen

2008/10/23 Sudarsan, Sithu D. <[EMAIL PROTECTED]>:
>
> Hi,
>
> We are trying to index large collection of PDF documents, sizes varying
> from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
> text extraction) and on Windows as well as CentOS Linux. Used java -Xms
> and -Xmx options, both at 1080m, even though we have 4GB on Windows and
> 32 GB on Linux with sufficient swap space.
>
> With just one thread, though it takes time, the indexing happens. To
> speed up, we tried multi-threaded approach with one Indexwriter for each
> thread. After all the threads finish their indexing, they are merged.
> With about 100 sample files and 10 threads, the program works pretty
> well and it does speed up. But, when we run on document collection of
> about 25GB, couple of threads just hang, while the rest have completed
> their indexing. The program never gracefully exits, and the threads that
> seem to have died ensure that the final index merging does not take
> place. The program needs to be manually terminated.
>
> Tried both with simple analyzer as well as standard analyzer, with
> similar results.
>
> Any useful tips / solutions welcome.
>
> Thanks in advance,
> Sithu Sudarsan
> Graduate Research Assistant, UALR
> & Visiting Researcher, CDRH/OSEL
>
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
See also:
http://zzzoot.blogspot.com/2007/10/drill-clouds-for-search-refinement-id.html
 and
http://zzzoot.blogspot.com/2007/10/tag-cloud-inspired-html-select-lists.html

-glen

2008/10/16 Glen Newton <[EMAIL PROTECTED]>:
> Yes, tag clouds.
>
> I've implemented them using Lucene here for NRC Research Press articles:
> http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?tagCloud=true&collection=jos&tagField=keyword&keyword=%22chromatin%22&numCloudDocs=200&numCloudTags=50&sortBy=relevance
>
> and here on the Colorado State University Libraries' Catalog:
> http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?tagCloud=true&collection=csu&tagField=keyword&title=cell&numCloudDocs=200&numCloudTags=50&sortBy=relevance
>
> As I use them for query refinement (click on the term & it is appended
> to your existing query & you get new results), I call them "drill
> clouds": 
> http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Drill_Clouds#Drill_Clouds
>
> -glen
>
> 2008/10/16 Darren Govoni <[EMAIL PROTECTED]>:
>> I guess a link map (as I understand it) is a collection of hyperlinks of
>> words/phrases where the dominant ones are bolder color and larger font.
>> Its relatively new schema, some sites are using.
>>
>> For example, someone searches for a person and a link map would show
>> them all the most frequent terms in the results they got back. Sort of
>> like latent relationships.
>>
>> Does that help?
>>
>> I thought this could be done using term frequency vectors in Lucene, but
>> I've never used TFV's before. And can then be limited to just a set of
>> results.
>>
>> HTH,
>> Darren
>>
>> On Thu, 2008-10-16 at 14:09 -0400, Glen Newton wrote:
>>> Sorry, could you explain what you mean by a "link map over lucene results"?
>>>
>>> thanks,
>>> -glen
>>>
>>> 2008/10/16 Darren Govoni <[EMAIL PROTECTED]>:
>>> > Hi,
>>> >  Has anyone created a link map over lucene results or know of a link
>>> > describing the process? If not, I would like to build one to contribute.
>>> >
>>> > Also, I read about term frequencies in the book, but wanted to know if I
>>> > can extract the strongest occurring terms from a given result set or
>>> > result?
>>> >
>>> > thank you for any help. I will keep reading/looking.
>>> >
>>> > Darren
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> > For additional commands, e-mail: [EMAIL PROTECTED]
>>> >
>>> >
>>>
>>>
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
>
> --
>
> -
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
Yes, tag clouds.

I've implemented them using Lucene here for NRC Research Press articles:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?tagCloud=true&collection=jos&tagField=keyword&keyword=%22chromatin%22&numCloudDocs=200&numCloudTags=50&sortBy=relevance

and here on the Colorado State University Libraries' Catalog:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?tagCloud=true&collection=csu&tagField=keyword&title=cell&numCloudDocs=200&numCloudTags=50&sortBy=relevance

As I use them for query refinement (click on the term & it is appended
to your existing query & you get new results), I call them "drill
clouds": 
http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Drill_Clouds#Drill_Clouds

-glen

2008/10/16 Darren Govoni <[EMAIL PROTECTED]>:
> I guess a link map (as I understand it) is a collection of hyperlinks of
> words/phrases where the dominant ones are bolder color and larger font.
> Its relatively new schema, some sites are using.
>
> For example, someone searches for a person and a link map would show
> them all the most frequent terms in the results they got back. Sort of
> like latent relationships.
>
> Does that help?
>
> I thought this could be done using term frequency vectors in Lucene, but
> I've never used TFV's before. And can then be limited to just a set of
> results.
>
> HTH,
> Darren
>
> On Thu, 2008-10-16 at 14:09 -0400, Glen Newton wrote:
>> Sorry, could you explain what you mean by a "link map over lucene results"?
>>
>> thanks,
>> -glen
>>
>> 2008/10/16 Darren Govoni <[EMAIL PROTECTED]>:
>> > Hi,
>> >  Has anyone created a link map over lucene results or know of a link
>> > describing the process? If not, I would like to build one to contribute.
>> >
>> > Also, I read about term frequencies in the book, but wanted to know if I
>> > can extract the strongest occurring terms from a given result set or
>> > result?
>> >
>> > thank you for any help. I will keep reading/looking.
>> >
>> > Darren
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
Sorry, could you explain what you mean by a "link map over lucene results"?

thanks,
-glen

2008/10/16 Darren Govoni <[EMAIL PROTECTED]>:
> Hi,
>  Has anyone created a link map over lucene results or know of a link
> describing the process? If not, I would like to build one to contribute.
>
> Also, I read about term frequencies in the book, but wanted to know if I
> can extract the strongest occurring terms from a given result set or
> result?
>
> thank you for any help. I will keep reading/looking.
>
> Darren
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Scalability, Multiwriter?

2008-10-10 Thread Glen Newton
IndexWriter is thread-safe and has been for a while
(http://www.mail-archive.com/[EMAIL PROTECTED]/msg00157.html)
so you don't have to worry about that.

As reported in my blog in April
(http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html)
but perhaps not explicitly enough: in indexing 6.4M full-text articles
generating an index of 83GB, I used a pipeline architecture consisting
of a several ThreadPoolExecutors:

1 - A main program that gets the article metadata (author, title,
abstract, etc) from JDBC + creates Article object + adds it to #2
queue;

2 - A pool with a queue of 100 Article objects; the Runnable reads the
full-text for the article from the file system. The files are GZiped,
so this is also done. Full-text is added to Article object & Article
object added to queue #3. 4 threads (as more causes major performance
degradation through IO waits).

3 - A pool with a queue of 1000 Article objects; the Runnable creates
a Lucene Document from the Article object fields and adds the Document
to queue #4. 64 threads are running in this pool.

4 - A pool with a queue of 100 Documents; the Runnable adds the
Document to one of
8 IndexWriters, sent roundrobin. 16 threads running in this queue.

When all documents are processed, all 8 IndexWriters are merged into a
single index and optimized. From the blog entry: 20.5 hours to process
6.4M articles, 143GB text. See the entry for software/VM/hardware
details.

I tried all combinations of threads/pool size/#IndexWriters and the
above was the 'sweet point' for my particular index and hardware.

I hope this is helpful. If you have any questions, please let me know.

Related:
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-searcher-performance.html

-Glen



2008/10/10 Darren Govoni <[EMAIL PROTECTED]>:
> Hi gang,
>  Wondering how folks have address scaled up indexing. I saw old threads
> about using clustered webapp with JNDI singleton index writer due to the
> Lucene single writer limitation. Is this limitation lifted in 3 maybe?
> Is there a best strategy for parallel writing to an index by many
> threads?
>
> thanks for any tips! You guys rock.
> Darren
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: could I implement this scenario?

2008-09-19 Thread Glen Newton
> I think it is not good idea to use lucene as storage, it is just index.

I strongly disagree with this position.

To qualify my disagreement: yes, you should not use Lucene as your
primary storage for your data in your organization.

But, for a particular application, taking content from your primary
storage system (SQL database, filesystem files, etc) and - in the
context of an end-user application - both indexing and storing the
content is a good solution. The stored content in Lucene if
effectively a cache.

Advantages:
- faster (don't have to make additional queries to find content in
primary storage system)
- less system dependencies (if the primary system is down...)
- no longer hitting primary storage system (which are usually already
busy doing other things and also tend to be expensive)
- simpler

Disadvantages
- larger index
- might be slower, if the index is significantly larger
- updating issues

thanks,

Glen
http://zzzoot.blogspot.com/

2008/9/19 Dragan Jotanovic <[EMAIL PROTECTED]>:
> I think it is not good idea to use lucene as storage, it is just index.
> You could probably implement this using flat files and lucene.
> Your simDocId would be stored field which you can retrieve from the index 
> after search, and it could also contain the information where on the disk is 
> document located.
>
>
> -Original Message-
> From: xh sun [mailto:[EMAIL PROTECTED]
> Sent: Friday, September 19, 2008 9:44 AM
> To: java-user@lucene.apache.org
> Subject: Re: could I implement this scenario?
>
> I store the data in flatfiles and db. I want to implement it using Lucene 
> only, but if it fails, maybe I shall create a temporary table for each query.
>
>
>
> - Original Message 
> From: mathieu <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Friday, September 19, 2008 4:34:13 PM
> Subject: Re: could I implement this scenario?
>
>
> Lucene is just an index. Where do you wont to store your data? in a db,
> flatfiles, document with an url, in lucene?
>
> M.
>
> On Fri, 19 Sep 2008 16:25:27 +0800 (CST), xh sun
> <[EMAIL PROTECTED]> wrote:
>> Thank you. Mathieu.
>>
>> But the hits don't include the document doc02  in my example, how to
>> display  doc02?  I don't want to search by docid. Thanks.
>>
>>
>>
>> - Original Message 
>> From: mathieu <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Friday, September 19, 2008 4:14:34 PM
>> Subject: Re: could I implement this scenario?
>>
>>
>>
>> Yes. You can store data in lucene index and don't search on it : your
>> simdocid.
>>
>> M.
>>
>> On Fri, 19 Sep 2008 16:00:20 +0800 (CST), xh sun
>> <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>>
>>> How can I implemented this scenario in lucene?
>>>
>>> suppose every document has three fields: docid, doctext and simdocid.
>>> docid is the id of the document, doctext is the content of the document,
>>> dimdocid is the docid of a similar document with this document.
>>> example:
>>> docid  doctextsimdocid
>>> doc01     doc04
>>> doc02     doc03
>>> doc03     doc02
>>> doc04     doc03
>>> doc05     doc04
>>> doc06     doc02
>>>
>>> During query, the index will be searched basing on field doctext. If the
>>> hits include four documents doc01,doc03,doc04, doc05, I want to display
>> the
>>> corresponding similar documents only, that is, the three documents
>>> doc04,doc02,doc03.
>>>
>>> Appreicate your help very much.
>>>
>>> BR,
>>> Shawn
>>>
>>>
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tree search

2008-08-07 Thread Glen Newton
There are a number of ways to do this. Here is one:
Lose the parentid field (unless you have other reasons to keep it).
Add a field fullName, and a field called depth :

doc1
fullName: state
depth: 0

doc2
fullName: state/department
depth:1

doc3
fullName: state/department/Boston
depth: 2

doc4
fullName: state/department/Opera
depth: 2

doc4
fullName: state/Chicago
depth: 1

doc6
fullName: state/department/Opera/November
depth: 3

> 1. Same path, for example: /state/department/Boston – return doc3
query: fullName:+/state/department/Boston

> 2. Child of the path,  for example: /state/department – return doc3,doc4
query: fullName:+/state/department depth:+"1"

> 3. All childs of the path for example: /state/department - return 
> doc3,doc4,doc6
query: fullName:+/state/department/

Is this what you need?
Depending on your use cases, there may be better ways of implementing this.

As this is not a relational db, we are not concerned (hopefully) with
the replicated information in the fullName field.

thanks,

Glen

2008/8/7 Sergey Kabashnyuk <[EMAIL PROTECTED]>:
> Hello
> I have  such  document   structure
> doc1
> id   - 1
> parentid - 0
> name  -state
> doc2
> id   - 2
> parentid - 1
> name - department
> doc3
> id   - 3
> parentid - 2
> name – Boston
> doc4
> id   - 4
> parentid - 2
> name – Opera
> doc5
> id   – 5
> parentid – 1
> name - Chicago
> doc6
> id   - 6
> parentid - 4
> name – November
>
> All document a linked by parentid = id – of parent document.
> By this link can be retrieved the full path of document,
> for example:
> doc3-/state/department/Boston
> Doc5 - /state/Chicago
>
>
> I want to implement search by path
> 1. Same path, for example: /state/department/Boston – return doc3
> 2. Child of the path,  for example: /state/department – return doc3,doc4
> 3. All childs of the path for example: /state/department - return
> doc3,doc4,doc6
>
> I need to advice how the best way it can be implemented?
>
> Moving  is very often operation and storing full path can cost with
> additional unwanted operations and therefore it's not a desirable solution
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-


Re: Scaling

2008-07-16 Thread Glen Newton
A subset of your questions are answered (or at least examined) in my
postings on multi-thread queries on a multiple-core single system:
http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html

-Glen

2008/7/16 Karl Wettin <[EMAIL PROTECTED]>:
> Is there some sort of a scaling strategies listing available? I think there
> is a Wiki page missing.
>
> What are the typical promblems I'll encounter when distributing the search
> over multiple machines?
>
> Do people split up their index per node or do they use the complete index
> and restrict what part to search in using filters? The latter would be good
> for the scores, right? Then how do I calculate the cost in speed for the
> score with better quality? I mean, splitting the index in two and searching
> on two machines using ParallelMultiSearcher probably means that I'll get
> something like 30% speed improvement and not 100%. Or?
>
> Is there something to win by using multiple threads each restricted to a
> part each of the same index on a single machine, compared to a single
> thread? Or is it all I/O? That would mean there is something to gain if the
> index was on SSD or in RAM, right?
>
>
>  karl
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to make documents clustering and topic classification with lucene

2008-07-08 Thread Glen Newton
Use Carrot2:
 http://project.carrot2.org/

For Lucene + Carrot2:
 http://project.carrot2.org/faq.html#lucene-integration

-glen

2008/7/7 Ariel <[EMAIL PROTECTED]>:
> Hi everybody:
> Do you have Idea how to make how to make documents clustering and topic
> classification using lucene ??? Is there anyway to do this.
> Please I need help.
> Thanks everybody.
> Ariel
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-13 Thread Glen Newton
Lutan,

Yes, no problem. I am away at a conference next week but plan to
release the code the following week. Is this OK for you?

thanks,

Glen

2008/6/13 lutan <[EMAIL PROTECTED]>:
>
> TO: Glen Newton Could I get your test code or code architecture for study.
> I have try to using java.util.concurrent package(
> like ArrayBlockingQueue  ThreadPoolExecutor;)
>  with lucene,but it is no successful.I don't
> know how to design.
>
>
> Thanks ! my emial: [EMAIL PROTECTED]
> _
> 多个邮箱同步管理,live mail客户端万人抢用中
> http://get.live.cn/product/mail.html



-- 

-


Re: Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-11 Thread Glen Newton
Hi Otis,

Thanks for the feedback.

2008/6/11 Otis Gospodnetic <[EMAIL PROTECTED]>:
> Hi Glen,
>
> Aha, good to see the benefit of multiple IndexReaders/Searchers so clearly.  
> Makes me think we'll want to add a config setting for this in Solr... :)

Until then, you might want to use: Runtime.availableProcessors()
as the default #.
Oh no, that won't work: it gives me 8 (the number of hyperthread
processors) versus 4 (# of real cores. Hmm, I consider not being able
to find the number of physical cores as being, well, a bug (I guess
you could turn-off hyperthreading). Anyone know if there is a JSR
looking for perhaps: Runtime.availableRealProcessors()  ??

> As for why 4 is the best choice, I think it's because of those 4 cores that 
> you've got.  My guess is that you'll see slightly better performance with 5 
> threads and then the performance will slowly deteriorate with more 
> readers/searchers let's see it!

I'm running it & will post when it is done.

thanks,

Glen  :-)

>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
>> From: Glen Newton <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, June 11, 2008 2:07:45 PM
>> Subject: Concurrent query benchmarks, with 1,2,4,8 readers
>>
>> I have extended my evaluation (previous evaluation:
>> http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html)
>> to include as well as an increasing # of threads performing concurrent
>> queries, 1,2,4 and 8 IndexReaders.
>>
>> The results can be found here:
>> http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
>>
>> thanks,
>>
>> Glen
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-11 Thread Glen Newton
I have extended my evaluation (previous evaluation:
http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html)
to include as well as an increasing # of threads performing concurrent
queries, 1,2,4 and 8 IndexReaders.

The results can be found here:
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html

thanks,

Glen

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent query benchmarks

2008-06-10 Thread Glen Newton
Thanks for the positive feedback.  :-)

Yes, right now the benchmark only uses one IndexSearcher for all
threads, but I have completed an extension that allows you to either
1) have multiple searchers for the same index; or 2) have multiple
indexes (copies of one another) with a single searcher per copy (to
test when you have your index copies on separate disks, SANS, NAS,
etc).

I will rerun my benchmarks with increasing numbers of readers & post
the results in the next couple of days.

-glen

2008/6/10 Chris Lu <[EMAIL PROTECTED]>:
> Good work!


> I would like to see how it performs with several index reader instances,
> which is said to increase concurrency.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Mon, Jun 9, 2008 at 3:51 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
>
>> A number of people have asked about query benchmarks.
>>
>> I have posted benchmarks for concurrent query requests for Lucene
>> 2.3.1 on my blog, where I look at 1 - 4096 concurrent requests:
>>
>> http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
>>
>> I hope you find this useful.
>>
>> thanks,
>>
>> Glen
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent query benchmarks

2008-06-10 Thread Glen Newton
2008/6/9 Otis Gospodnetic <[EMAIL PROTECTED]>:
> Hi Glen,
>
> Thanks for sharing.  Does your benchmarking tool build on top of 
> contrib/benchmark? (not sure if that one lets you specify the number of 
> concurrent threads -- if it does not, perhaps this is an opportunity to add 
> this functionality).

No, it is a stand-alone program. You give it the index directory, the
default query field, the number of threads, and the filename of a file
that contains one Lucene query per line.hreads. The output is one
line: the # fo threads followed by the #queries handled per second.

I have a shell script which runs the above with increasing #s of threads.

> I couldn't find info about the index format (compound or not) you used.  It 
> would be good to see the comparison with high number of threads for the 2 
> index formats.  It would also be good to see the numbers when the index has 
> no deletion and when it has some percentage of docs deleted.

Sorry, I didn't include it. The index in the benchmarks uses the
compound format, with 0% documents deleted.

>
> Finally, if you end up extending contrib/benchmark, I think just having the 
> ability to pump the results of that into a gnuplot script would be nice to 
> have.  I've written a standalone benchmarking tool that did pretty much what 
> yours seems to do, but I wrote it for Technorati, so I can't release it. :(

I would be very willing to contribute what I have, with the gnuplot
scripts that I have. Let me finish off what I am doing for my work and
I will clean things up a bit, write a little documentation.

-Glen

>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
>> From: Glen Newton <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Tuesday, June 10, 2008 12:51:41 AM
>> Subject: Concurrent query benchmarks
>>
>> A number of people have asked about query benchmarks.
>>
>> I have posted benchmarks for concurrent query requests for Lucene
>> 2.3.1 on my blog, where I look at 1 - 4096 concurrent requests:
>>   http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
>>
>> I hope you find this useful.
>>
>> thanks,
>>
>> Glen
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrent query benchmarks

2008-06-09 Thread Glen Newton
A number of people have asked about query benchmarks.

I have posted benchmarks for concurrent query requests for Lucene
2.3.1 on my blog, where I look at 1 - 4096 concurrent requests:
  http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html

I hope you find this useful.

thanks,

Glen

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-language support within a single index

2008-06-05 Thread Glen Newton
Yes, thank-you for the pointer, and apologies for not doing my
homework better. :-)
It is exactly what I want.

The scenario is where I have articles which tend to be in english and
have abstracts, and for some of them have french language abstracts.
Users may want to search the english abstracts or the french abstracts
or both.

thanks,

Glen

2008/6/5 Erick Erickson <[EMAIL PROTECTED]>:
> I'm not sure what you're getting at, but it seems awful similar to
> PerFieldAnalyzerWrapper that already exists and does (it seems
> to me on a quick scan) to do exactly what you want. And it
> works for both indexing and querying out-of-the-box.
>
> Best
> Erick
>
> On Thu, Jun 5, 2008 at 12:14 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
>
>> I would like to be able to get multi-language support within a single
>> index.
>> I would appreciate input on what I am suggesting:
>>
>> Assuming that you want something like the following in your document:
>> Title_english
>> Title_french
>> Title_german
>> Keyword_english
>> Keyword_french
>> Keyword_german
>>
>> Let's pretend for now that each of these was created with a different
>> appropriate analyzer and the mechanisms for doing this exist (see end
>> of post for more on this).
>>
>> How to handle a query?
>> Could we associate an Analyzer with a set of fields, like this:
>> // pseudo java
>> Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
>> Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"});
>> Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
>> Analyzer ml = new MultiLanguageAnalyzer();
>> (MultiLanguageAnalyzer)ml.add(ea);
>> (MultiLanguageAnalyzer)ml.add(fa);
>> (MultiLanguageAnalyzer)ml.add(ga);
>> QueryParser parser = MultiLanguageParser("TitleEnglish", ml);
>> // end
>>
>> Now when
>>  parser.parse("TitleEnglish: foo TitleFrench:bar  smith")
>> is called, MultiLanguageParser uses the appropriate analyzer for each
>> field in the query to parse the sub-query & rolls up all of the
>> queries created by these analyzers into the real query.
>>
>> I am thinking that this would require having separate term
>> dictionaries for each language, thus demanding a significant change in
>> the index format? [Note I am not an expert on Lucene internals]
>>
>> Of course, something similar to the above could be used adding
>> documents to the index.
>>
>> Looking at:
>>  http://lucene.apache.org/java/docs/fileformats.html#Per-Segment%20Files
>> It seems that it would need - instead of the present single set - a
>> set of segment files for each analyzer: .fnm (Fields), tis & tii (term
>> dictionary), .frq (term frequencies), .prx (positions), .nrm
>> (normalizations), .tvx, .tvd, .tvf (term vectors).
>> How stable is the code for this part of the index & would it easily
>> support this kind of extension? Or would some re-factoring be needed
>> to make these sorts of manipulations to the nature of the segments
>> files easier for mere mortal developers?  :-)
>>
>> Is this something that is already being talked about/looked in
>> to/being implemented? :-)
>>
>> thanks,
>>
>> Glen Newton
>> http://zzzoot.blogspot.com/
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multi-language support within a single index

2008-06-05 Thread Glen Newton
I would like to be able to get multi-language support within a single index.
I would appreciate input on what I am suggesting:

Assuming that you want something like the following in your document:
Title_english
Title_french
Title_german
Keyword_english
Keyword_french
Keyword_german

Let's pretend for now that each of these was created with a different
appropriate analyzer and the mechanisms for doing this exist (see end
of post for more on this).

How to handle a query?
Could we associate an Analyzer with a set of fields, like this:
// pseudo java
Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"});
Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
Analyzer ml = new MultiLanguageAnalyzer();
(MultiLanguageAnalyzer)ml.add(ea);
(MultiLanguageAnalyzer)ml.add(fa);
(MultiLanguageAnalyzer)ml.add(ga);
QueryParser parser = MultiLanguageParser("TitleEnglish", ml);
// end

Now when
  parser.parse("TitleEnglish: foo TitleFrench:bar  smith")
is called, MultiLanguageParser uses the appropriate analyzer for each
field in the query to parse the sub-query & rolls up all of the
queries created by these analyzers into the real query.

I am thinking that this would require having separate term
dictionaries for each language, thus demanding a significant change in
the index format? [Note I am not an expert on Lucene internals]

Of course, something similar to the above could be used adding
documents to the index.

Looking at:
  http://lucene.apache.org/java/docs/fileformats.html#Per-Segment%20Files
It seems that it would need - instead of the present single set - a
set of segment files for each analyzer: .fnm (Fields), tis & tii (term
dictionary), .frq (term frequencies), .prx (positions), .nrm
(normalizations), .tvx, .tvd, .tvf (term vectors).
How stable is the code for this part of the index & would it easily
support this kind of extension? Or would some re-factoring be needed
to make these sorts of manipulations to the nature of the segments
files easier for mere mortal developers?  :-)

Is this something that is already being talked about/looked in
to/being implemented? :-)

thanks,

Glen Newton
http://zzzoot.blogspot.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   >