[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Peuss updated LUCENE-1166:
-

Attachment: de.xml

A hyphenation grammar. You can download them from: 
http://downloads.sourceforge.net/offo/offo-hyphenation.zip?modtime=1168687306big_mirror=0

 A tokenfilter to decompose compound words
 -

 Key: LUCENE-1166
 URL: https://issues.apache.org/jira/browse/LUCENE-1166
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Thomas Peuss
 Attachments: CompoundTokenFilter.patch, de.xml


 A tokenfilter to decompose compound words you find in many germanic languages 
 (like German, Swedish, ...) into single tokens.
 An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
 that you can find the word even when you only enter Schiff.
 I use the hyphenation code from the Apache XML project FOP 
 (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
 Currently I use the FOP jars directly. I only use a handful of classes from 
 the FOP project.
 My question now:
 Would it be OK to copy this classes over to the Lucene project (renaming the 
 packages of course) or should I stick with the dependency to the FOP jars? 
 The FOP code uses the ASF V2 license as well.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Peuss updated LUCENE-1166:
-

Attachment: hyphenation.dtd

The DTD describing the hyphenation grammar XML files.

 A tokenfilter to decompose compound words
 -

 Key: LUCENE-1166
 URL: https://issues.apache.org/jira/browse/LUCENE-1166
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Thomas Peuss
 Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd


 A tokenfilter to decompose compound words you find in many germanic languages 
 (like German, Swedish, ...) into single tokens.
 An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
 that you can find the word even when you only enter Schiff.
 I use the hyphenation code from the Apache XML project FOP 
 (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
 Currently I use the FOP jars directly. I only use a handful of classes from 
 the FOP project.
 My question now:
 Would it be OK to copy this classes over to the Lucene project (renaming the 
 packages of course) or should I stick with the dependency to the FOP jars? 
 The FOP code uses the ASF V2 license as well.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)
A tokenfilter to decompose compound words
-

 Key: LUCENE-1166
 URL: https://issues.apache.org/jira/browse/LUCENE-1166
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Thomas Peuss
 Attachments: CompoundTokenFilter.patch

A tokenfilter to decompose compound words you find in many germanic languages 
(like German, Swedish, ...) into single tokens.

An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
that you can find the word even when you only enter Schiff.

I use the hyphenation code from the Apache XML project FOP 
(http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
Currently I use the FOP jars directly. I only use a handful of classes from the 
FOP project.

My question now:
Would it be OK to copy this classes over to the Lucene project (renaming the 
packages of course) or should I stick with the dependency to the FOP jars? The 
FOP code uses the ASF V2 license as well.

What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Peuss updated LUCENE-1166:
-

Attachment: CompoundTokenFilter.patch

A preliminary version of the token filter.

 A tokenfilter to decompose compound words
 -

 Key: LUCENE-1166
 URL: https://issues.apache.org/jira/browse/LUCENE-1166
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Thomas Peuss
 Attachments: CompoundTokenFilter.patch


 A tokenfilter to decompose compound words you find in many germanic languages 
 (like German, Swedish, ...) into single tokens.
 An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
 that you can find the word even when you only enter Schiff.
 I use the hyphenation code from the Apache XML project FOP 
 (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
 Currently I use the FOP jars directly. I only use a handful of classes from 
 the FOP project.
 My question now:
 Would it be OK to copy this classes over to the Lucene project (renaming the 
 packages of course) or should I stick with the dependency to the FOP jars? 
 The FOP code uses the ASF V2 license as well.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by DoronCohen

2008-02-06 Thread Grant Ingersoll

Hey Doron,

I see you recommend that we think about making SweetSpot the default  
similarity.  Do you have numbers showing for running that alone?  Or  
for that matter, any of the other combinations of #3 individually?


Thanks,
Grant

On Jan 31, 2008, at 4:09 AM, Doron Cohen wrote:


Hi Otis,

On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic 
[EMAIL PROTECTED] wrote:


Doron - this looks super useful!
Can you give an example for the lexical affinities you mention here?
(Juru creates posting lists for lexical affinities)



Sure, - simply put, denote {X} as the posting list of term X, then  
for a
query - A B C D - in addition to the four posting lists {A}, {B},  
{C}, {D}
which are processed ignoring position info (i.e. Lucene's  
termDocs()) Juru
also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C},  
{B,D} and
{C,D} in which a (virtual) term {X,Y} is said to exist in a document  
D if
the two words X and Y are found in that document within a sliding  
window of

size L (say 5).

(You can also require LA's in order which is useful in some  
scenarios.)


Juru's tokenization detects sentences and so the two words must be  
in the
same sentence. The term-freq of that LA-term in the doc is as usual  
the

number of matches in that doc satisfying this sliding window rule.

The IDF of this term is not known in advance, and so it is first  
estimated

based on the DF of X and Y, and this estimate is later tuned as more
documents are processed and more statistics are available.

You can see the resemblance to SpanNear queries. Note that the IDF  
of this
virtual term is going to be high and as such it is focusing the  
query

search on the more relevant documents.

In my Lucene implementation for this I used a window size of 7, and  
note

that (1) there was no sentence boundaries knowledge in my Lucene
implementation and (2) the IDF was fixed all along, estimated by the
involved terms IDF, as computed once in SpanNear query. The default
computation is their sum. This is in most cases too low an IDF, I  
think.

Phrase query btw behaves the same.

So in both cases (Phrase, Span) I think it would be interesting to
experiment with adaptive IDF computation that updates the IDF as more
documents are processed. When the query is made of only a single  
span or
only a single phrase element this is a waste of time. But when the  
query is

more complex (as the query we built) and you have in the query both
multi-term parts and single-term parts, or several multi-term parts,  
then a
more accurate IDF can improve the quality I would think.  
Implementation wise
the Weight.value would need to be updated and might raise  
questions about
the normalizing of other query parts, but I am not sure about this  
now.


Well I hope this makes sense - I will update the Wiki page with  
similar

info...

Also:


Normalized term-frequency, as in Juru.
Here, tf(freq) is normalized by the average term frequency of the
document.

I've never seen this mentioned anywhere except here and once here  
on the

ML (was it you who mentioned this?), but this sounds intuitive.



Yes I think I mentioned this - I think it is not our idea - Juru  
uses it but

it was used before in the SMART system - see Length Normalization in
Degraded Text Collections (1995) - http://citeseer.ist.psu.edu/100699.html 
,

and New Retrieval Approaches Using SMART : TREC 4 -
http://citeseer.ist.psu.edu/144841.html.



What do others think?
Otis



--
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566163#action_12566163
 ] 

Steven Rowe commented on LUCENE-1157:
-

If I browse to 
[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/changes/] , or anything 
in that directory, including Changes.html, I see a Hudson page dedicated to 
per-nightly-build commits.  Nice page :).  I'm guessing what's going on is a 
namespace issue: on the hudson server, anything you put into 
{{Lucene-trunk/changes/}} is unlinkable-to, because that directory is dedicated 
to the Hudson Changes page.

Fixing this may be as simple as changing the name of the target directory, 
maybe to {{official-changes/}} or something like that.

 Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
 read by Lucene users)
 -

 Key: LUCENE-1157
 URL: https://issues.apache.org/jira/browse/LUCENE-1157
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 2.4

 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
 lucene-1157.patch


 Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566167#action_12566167
 ] 

Doron Cohen commented on LUCENE-1157:
-

I suspected something like this but wasn't sure...
Ok I'll rename the directory and then we'll see.


 Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
 read by Lucene users)
 -

 Key: LUCENE-1157
 URL: https://issues.apache.org/jira/browse/LUCENE-1157
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 2.4

 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
 lucene-1157.patch


 Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-997:
---

Attachment: timeout.patch

Attached patch corrects default resolution comment.

 Add search timeout support to Lucene
 

 Key: LUCENE-997
 URL: https://issues.apache.org/jira/browse/LUCENE-997
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Sean Timm
Priority: Minor
 Attachments: HitCollectorTimeoutDecorator.java, 
 LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
 timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
 timeout.patch, timeout.patch, TimerThreadTest.java


 This patch is based on Nutch-308. 
 This patch adds support for a maximum search time limit. After this time is 
 exceeded, the search thread is stopped, partial results (if any) are returned 
 and the total number of results is estimated.
 This patch tries to minimize the overhead related to time-keeping by using a 
 version of safe unsynchronized timer.
 This was also discussed in an e-mail thread.
 http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-997:
---

Attachment: timeout.patch

 Add search timeout support to Lucene
 

 Key: LUCENE-997
 URL: https://issues.apache.org/jira/browse/LUCENE-997
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Sean Timm
Priority: Minor
 Attachments: HitCollectorTimeoutDecorator.java, 
 LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
 timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
 timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java


 This patch is based on Nutch-308. 
 This patch adds support for a maximum search time limit. After this time is 
 exceeded, the search thread is stopped, partial results (if any) are returned 
 and the total number of results is estimated.
 This patch tries to minimize the overhead related to time-keeping by using a 
 version of safe unsynchronized timer.
 This was also discussed in an e-mail thread.
 http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566175#action_12566175
 ] 

Doron Cohen commented on LUCENE-997:


Oh wrote comment that was before I decided to change the default... 
Thanks for catching this.

 Add search timeout support to Lucene
 

 Key: LUCENE-997
 URL: https://issues.apache.org/jira/browse/LUCENE-997
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Sean Timm
Priority: Minor
 Attachments: HitCollectorTimeoutDecorator.java, 
 LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
 timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
 timeout.patch, TimerThreadTest.java


 This patch is based on Nutch-308. 
 This patch adds support for a maximum search time limit. After this time is 
 exceeded, the search thread is stopped, partial results (if any) are returned 
 and the total number of results is estimated.
 This patch tries to minimize the overhead related to time-keeping by using a 
 version of safe unsynchronized timer.
 This was also discussed in an e-mail thread.
 http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-06 Thread Sean Timm (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566171#action_12566171
 ] 

Sean Timm commented on LUCENE-997:
--

Doron, your comment for setResolution(long) says The default timer resolution 
is 50 milliseconds, however, the default is actually 20 ms (public static 
final int DEFAULT_RESOLUTION = 20;).  Other than that, everything looks great.

 Add search timeout support to Lucene
 

 Key: LUCENE-997
 URL: https://issues.apache.org/jira/browse/LUCENE-997
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Sean Timm
Priority: Minor
 Attachments: HitCollectorTimeoutDecorator.java, 
 LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
 timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
 timeout.patch, TimerThreadTest.java


 This patch is based on Nutch-308. 
 This patch adds support for a maximum search time limit. After this time is 
 exceeded, the search thread is stopped, partial results (if any) are returned 
 and the total number of results is estimated.
 This patch tries to minimize the overhead related to time-keeping by using a 
 version of safe unsynchronized timer.
 This was also discussed in an e-mail thread.
 http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566188#action_12566188
 ] 

Steven Rowe commented on LUCENE-1166:
-

Hi Thomas,

Looking at [http://offo.sourceforge.net/hyphenation/licenses.html], which seems 
to be the same information as in the off-hyphenation.zip file you attached to 
this issue, the license issue may be a problem - the hyphenation data is 
covered by different licenses on a per-language basis.  For example, there are 
two German data files, and both are licensed under a LaTeX license, as is the 
Danish file, and these two languages are the most likely targets for your 
TokenFilter.  IANAL, but unless Apache licenses can be secured for this data, I 
don't think the files can be incorporated directly into an Apache project.

Also, I don't see Swedish among the hyphenation data licenses - is it covered 
in some other way?

 A tokenfilter to decompose compound words
 -

 Key: LUCENE-1166
 URL: https://issues.apache.org/jira/browse/LUCENE-1166
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Thomas Peuss
 Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd


 A tokenfilter to decompose compound words you find in many germanic languages 
 (like German, Swedish, ...) into single tokens.
 An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
 that you can find the word even when you only enter Schiff.
 I use the hyphenation code from the Apache XML project FOP 
 (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
 Currently I use the FOP jars directly. I only use a handful of classes from 
 the FOP project.
 My question now:
 Would it be OK to copy this classes over to the Lucene project (renaming the 
 packages of course) or should I stick with the dependency to the FOP jars? 
 The FOP code uses the ASF V2 license as well.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Nigel Daley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566206#action_12566206
 ] 

Nigel Daley commented on LUCENE-1157:
-

I suggest you save the Changes.html as one of the build artifacts (just like 
the tar.gz files are saved).  Grant can add this file to the artifacts list in 
the Hudson configuration screen if you want this done.

 Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
 read by Lucene users)
 -

 Key: LUCENE-1157
 URL: https://issues.apache.org/jira/browse/LUCENE-1157
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 2.4

 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
 lucene-1157.patch


 Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566220#action_12566220
 ] 

Thomas Peuss commented on LUCENE-1166:
--

bq. Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which 
seems to be the same information as in the off-hyphenation.zip file you 
attached to this issue, the license issue may be a problem - the hyphenation 
data is covered by different licenses on a per-language basis. For example, 
there are two German data files, and both are licensed under a LaTeX license, 
as is the Danish file, and these two languages are the most likely targets for 
your TokenFilter. IANAL, but unless Apache licenses can be secured for this 
data, I don't think the files can be incorporated directly into an Apache 
project.

This is true. And that's why I uploaded the two files without the ASF license 
grant. The FOP project does not have the files in the code base as well because 
of the licensing problem.

bq. Also, I don't see Swedish among the hyphenation data licenses - is it 
covered in some other way?
OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of 
the LaTeX grammar files. I have a look into this tonight.

All other hyphenation implementations I have found so far use them either 
directly or in an converted variant like the FOP code. What we can do of course 
is to ask the authors of the LaTeX files if they want to license their work 
under the ASF license as well. It is worth a try. But I suppose that many email 
addresses in the LaTeX files are not used anymore. I try to contact the authors 
of the German grammar files tomorrow.

BTW: an example for those that don't want to try the patch:
+Input token stream:+
Rindfleischüberwachungsgesetz Drahtschere abba

+Output token stream:+
(Rindfleischüberwachungsgesetz,0,29)
(Rind,0,4,posIncr=0)
(fleisch,4,11,posIncr=0)
(überwachung,11,22,posIncr=0)
(gesetz,23,29,posIncr=0)
(Drahtschere,30,41)
(Draht,30,35,posIncr=0)
(schere,35,41,posIncr=0)
(abba,42,46)

 A tokenfilter to decompose compound words
 -

 Key: LUCENE-1166
 URL: https://issues.apache.org/jira/browse/LUCENE-1166
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Thomas Peuss
 Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd


 A tokenfilter to decompose compound words you find in many germanic languages 
 (like German, Swedish, ...) into single tokens.
 An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
 that you can find the word even when you only enter Schiff.
 I use the hyphenation code from the Apache XML project FOP 
 (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
 Currently I use the FOP jars directly. I only use a handful of classes from 
 the FOP project.
 My question now:
 Would it be OK to copy this classes over to the Lucene project (renaming the 
 packages of course) or should I stick with the dependency to the FOP jars? 
 The FOP code uses the ASF V2 license as well.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
There have been several proposals for a Lucene-based distributed index
architecture.
 1) Doug Cutting's Index Server Project Proposal at
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html
 2) Solr's Distributed Search at
http://wiki.apache.org/solr/DistributedSearch
 3) Mark Butler's Distributed Lucene at
http://wiki.apache.org/hadoop/DistributedLucene

We have also been working on a Lucene-based distributed index architecture.
Our design differs from the above proposals in the way it leverages Hadoop
as much as possible. In particular, HDFS is used to reliably store Lucene
instances, Map/Reduce is used to analyze documents and update Lucene
instances
in parallel, and Hadoop's IPC framework is used. Our design is geared for
applications that require a highly scalable index and where batch updates
to each Lucene instance are acceptable (verses finer-grained document at
a time updates).

We have a working implementation of our design and are in the process
of evaluating its performance. An overview of our design is provided below.
We welcome feedback and would like to know if you are interested in working
on it. If so, we would be happy to make the code publicly available. At the
same time, we would like to collaborate with people working on existing
proposals and see if we can consolidate our efforts.

TERMINOLOGY
A distributed index is partitioned into shards. Each shard corresponds
to
a Lucene instance and contains a disjoint subset of the documents in the
index.
Each shard is stored in HDFS and served by one or more shard servers. Here
we only talk about a single distributed index, but in practice multiple
indexes
can be supported.

A master keeps track of the shard servers and the shards being served by
them. An application updates and queries the global index through an
index client. An index client communicates with the shard servers to
execute a query.

KEY RPC METHODS
This section lists the key RPC methods in our design. To simplify the
discussion, some of their parameters have been omitted.

  On the Shard Servers
// Execute a query on this shard server's Lucene instance.
// This method is called by an index client.
SearchResults search(Query query);

  On the Master
// Tell the master to update the shards, i.e., Lucene instances.
// This method is called by an index client.
boolean updateShards(Configuration conf);

// Ask the master where the shards are located.
// This method is called by an index client.
LocatedShards getShardLocations();

// Send a heartbeat to the master. This method is called by a
// shard server. In the response, the master informs the
// shard server when to switch to a newer version of the index.
ShardServerCommand sendHeartbeat();

QUERYING THE INDEX
To query the index, an application sends a search request to an index
client.
The index client then calls the shard server search() method for each shard
of the index, merges the results and returns them to the application. The
index client caches the mapping between shards and shard servers by
periodically calling the master's getShardLocations() method.

UPDATING THE INDEX USING MAP/REDUCE
To update the index, an application sends an update request to an index
client.
The index client then calls the master's updateShards() method, which
schedules
a Map/Reduce job to update the index. The Map/Reduce job updates the shards
in
parallel and copies the new index files of each shard (i.e., Lucene
instance)
to HDFS.

The updateShards() method includes a configuration, which provides
information for updating the shards. More specifically, the configuration
includes the following information:
  - Input path. This provides the location of updated documents, e.g., HDFS
files or directories, or HBase tables.
  - Input formatter. This specifies how to format the input documents.
  - Analysis. This defines the analyzer to use on the input. The analyzer
determines whether a document is being inserted, updated, or deleted.
For
inserts or updates, the analyzer also converts each input document into
a Lucene document.

The Map phase of the Map/Reduce job formats and analyzes the input (in
parallel), while the Reduce phase collects and applies the updates to each
Lucene instance (again in parallel). The updates are applied using the local
file system where a Reduce task runs and then copied back to HDFS. For
example,
if the updates caused a new Lucene segment to be created, the new segment
would be created on the local file system first, and then copied back to
HDFS.

When the Map/Reduce job completes, a new version of the index is ready to
be
queried. It is important to note that the new version of the index is not
derived from scratch. By leveraging Lucene's update algorithm, the new
version
of each Lucene instance will share as many files as possible as the previous
version.

ENSURING INDEX CONSISTENCY
At any point in time, an index client always has 

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?

J.D.



On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote:

 There seem to be a few other players in this space too.

 Are you from Rackspace?
 (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
 query-terabytes-data)

 AOL also has a Hadoop/Solr project going on.

 CNET does not have much brewing there.  Although Yonik and I had
 talked about it a bunch -- but that was long ago.

 --cw

 Clay Webster   tel:1.908.541.3724
 Associate VP, Platform Infrastructure http://www.cnet.com
 CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]


  -Original Message-
  From: Ning Li [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 06, 2008 1:57 PM
  To: [EMAIL PROTECTED]; java-dev@lucene.apache.org; solr-
  [EMAIL PROTECTED]
  Subject: Lucene-based Distributed Index Leveraging Hadoop
 
  There have been several proposals for a Lucene-based distributed index
  architecture.
   1) Doug Cutting's Index Server Project Proposal at
 
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html
   2) Solr's Distributed Search at
  http://wiki.apache.org/solr/DistributedSearch
   3) Mark Butler's Distributed Lucene at
  http://wiki.apache.org/hadoop/DistributedLucene
 
  We have also been working on a Lucene-based distributed index
  architecture.
  Our design differs from the above proposals in the way it leverages
  Hadoop
  as much as possible. In particular, HDFS is used to reliably store
  Lucene
  instances, Map/Reduce is used to analyze documents and update Lucene
  instances
  in parallel, and Hadoop's IPC framework is used. Our design is geared
  for
  applications that require a highly scalable index and where batch
  updates
  to each Lucene instance are acceptable (verses finer-grained document
  at
  a time updates).
 
  We have a working implementation of our design and are in the process
  of evaluating its performance. An overview of our design is provided
  below.
  We welcome feedback and would like to know if you are interested in
  working
  on it. If so, we would be happy to make the code publicly available.
 At
  the
  same time, we would like to collaborate with people working on
 existing
  proposals and see if we can consolidate our efforts.
 
  TERMINOLOGY
  A distributed index is partitioned into shards. Each shard
  corresponds
  to
  a Lucene instance and contains a disjoint subset of the documents in
  the
  index.
  Each shard is stored in HDFS and served by one or more shard
 servers.
  Here
  we only talk about a single distributed index, but in practice
 multiple
  indexes
  can be supported.
 
  A master keeps track of the shard servers and the shards being
 served
  by
  them. An application updates and queries the global index through an
  index client. An index client communicates with the shard servers to
  execute a query.
 
  KEY RPC METHODS
  This section lists the key RPC methods in our design. To simplify the
  discussion, some of their parameters have been omitted.
 
On the Shard Servers
  // Execute a query on this shard server's Lucene instance.
  // This method is called by an index client.
  SearchResults search(Query query);
 
On the Master
  // Tell the master to update the shards, i.e., Lucene instances.
  // This method is called by an index client.
  boolean updateShards(Configuration conf);
 
  // Ask the master where the shards are located.
  // This method is called by an index client.
  LocatedShards getShardLocations();
 
  // Send a heartbeat to the master. This method is called by a
  // shard server. In the response, the master informs the
  // shard server when to switch to a newer version of the index.
  ShardServerCommand sendHeartbeat();
 
  QUERYING THE INDEX
  To query the index, an application sends a search request to an index
  client.
  The index client then calls the shard server search() method for each
  shard
  of the index, merges the results and returns them to the application.
  The
  index client caches the mapping between shards and shard servers by
  periodically calling the master's getShardLocations() method.
 
  UPDATING THE INDEX USING MAP/REDUCE
  To update the index, an application sends an update request to an
 index
  client.
  The index client then calls the master's updateShards() method, which
  schedules
  a Map/Reduce job to update the index. The Map/Reduce job updates the
  shards
  in
  parallel and copies the new index files of each shard (i.e., Lucene
  instance)
  to HDFS.
 
  The updateShards() method includes a configuration, which provides
  information for updating the shards. More specifically, the
  configuration
  includes the following information:
- Input path. This provides the location of updated documents, e.g.,
  HDFS

Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by DoronCohen

2008-02-06 Thread Doron Cohen
Hi Grant, yes I have these combinations - I just updated the wiki page with
these numbers.

I still have the index as described,allowing to try other ideas that may
come up, or if we need more tests (on GOV2 data) to take better decisions
...

Cheers, Doron

On Wed, Feb 6, 2008 at 2:15 PM, Grant Ingersoll [EMAIL PROTECTED] wrote:

 Hey Doron,

 I see you recommend that we think about making SweetSpot the default
 similarity.  Do you have numbers showing for running that alone?  Or
 for that matter, any of the other combinations of #3 individually?

 Thanks,
 Grant

 On Jan 31, 2008, at 4:09 AM, Doron Cohen wrote:

  Hi Otis,
 
  On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic 
  [EMAIL PROTECTED] wrote:
 
  Doron - this looks super useful!
  Can you give an example for the lexical affinities you mention here?
  (Juru creates posting lists for lexical affinities)
 
 
  Sure, - simply put, denote {X} as the posting list of term X, then
  for a
  query - A B C D - in addition to the four posting lists {A}, {B},
  {C}, {D}
  which are processed ignoring position info (i.e. Lucene's
  termDocs()) Juru
  also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C},
  {B,D} and
  {C,D} in which a (virtual) term {X,Y} is said to exist in a document
  D if
  the two words X and Y are found in that document within a sliding
  window of
  size L (say 5).
 
  (You can also require LA's in order which is useful in some
  scenarios.)
 
  Juru's tokenization detects sentences and so the two words must be
  in the
  same sentence. The term-freq of that LA-term in the doc is as usual
  the
  number of matches in that doc satisfying this sliding window rule.
 
  The IDF of this term is not known in advance, and so it is first
  estimated
  based on the DF of X and Y, and this estimate is later tuned as more
  documents are processed and more statistics are available.
 
  You can see the resemblance to SpanNear queries. Note that the IDF
  of this
  virtual term is going to be high and as such it is focusing the
  query
  search on the more relevant documents.
 
  In my Lucene implementation for this I used a window size of 7, and
  note
  that (1) there was no sentence boundaries knowledge in my Lucene
  implementation and (2) the IDF was fixed all along, estimated by the
  involved terms IDF, as computed once in SpanNear query. The default
  computation is their sum. This is in most cases too low an IDF, I
  think.
  Phrase query btw behaves the same.
 
  So in both cases (Phrase, Span) I think it would be interesting to
  experiment with adaptive IDF computation that updates the IDF as more
  documents are processed. When the query is made of only a single
  span or
  only a single phrase element this is a waste of time. But when the
  query is
  more complex (as the query we built) and you have in the query both
  multi-term parts and single-term parts, or several multi-term parts,
  then a
  more accurate IDF can improve the quality I would think.
  Implementation wise
  the Weight.value would need to be updated and might raise
  questions about
  the normalizing of other query parts, but I am not sure about this
  now.
 
  Well I hope this makes sense - I will update the Wiki page with
  similar
  info...
 
  Also:
 
  Normalized term-frequency, as in Juru.
  Here, tf(freq) is normalized by the average term frequency of the
  document.
 
  I've never seen this mentioned anywhere except here and once here
  on the
  ML (was it you who mentioned this?), but this sounds intuitive.
 
 
  Yes I think I mentioned this - I think it is not our idea - Juru
  uses it but
  it was used before in the SMART system - see Length Normalization in
  Degraded Text Collections (1995) -
 http://citeseer.ist.psu.edu/100699.html
  ,
  and New Retrieval Approaches Using SMART : TREC 4 -
  http://citeseer.ist.psu.edu/144841.html.
 
 
  What do others think?
  Otis
 

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com
 http://www.lucenebootcamp.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by DoronCohen

2008-02-06 Thread Doron Cohen
On Thu, Jan 31, 2008 at 11:09 AM, Doron Cohen [EMAIL PROTECTED] wrote:

 Hi Otis,

 On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic 
 [EMAIL PROTECTED] wrote:

  Doron - this looks super useful!
  Can you give an example for the lexical affinities you mention here?
  (Juru creates posting lists for lexical affinities)


 Sure, - simply put, denote {X} as the posting list of term X, then for a
 query - A B C D - in addition to the four posting lists {A}, {B}, {C}, {D}
 which are processed ignoring position info (i.e. Lucene's termDocs()) Juru
 also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C}, {B,D} and
 {C,D} in which a (virtual) term {X,Y} is said to exist in a document D if
 the two words X and Y are found in that document within a sliding window of
 size L (say 5).


The wiki page now has a more complete example.

(You can also require LA's in order which is useful in some scenarios.)

 Juru's tokenization detects sentences and so the two words must be in the
 same sentence. The term-freq of that LA-term in the doc is as usual the
 number of matches in that doc satisfying this sliding window rule.

 The IDF of this term is not known in advance, and so it is first estimated
 based on the DF of X and Y, and this estimate is later tuned as more
 documents are processed and more statistics are available.


This was not so accurate a description. What Juru really does is compute in
advance the first e.g. 1MB of the LA posting and use its computed IDF for
the entire posting. Experiments with more accurate adaptive computation (for
longer LA postings) showed no advantae over this simpler approach.


 You can see the resemblance to SpanNear queries. Note that the IDF of this
 virtual term is going to be high and as such it is focusing the query
 search on the more relevant documents.

 In my Lucene implementation for this I used a window size of 7, and note
 that (1) there was no sentence boundaries knowledge in my Lucene
 implementation and (2) the IDF was fixed all along, estimated by the
 involved terms IDF, as computed once in SpanNear query. The default
 computation is their sum. This is in most cases too low an IDF, I think.
 Phrase query btw behaves the same.

 So in both cases (Phrase, Span) I think it would be interesting to
 experiment with adaptive IDF computation that updates the IDF as more
 documents are processed. When the query is made of only a single span or
 only a single phrase element this is a waste of time. But when the query is
 more complex (as the query we built) and you have in the query both
 multi-term parts and single-term parts, or several multi-term parts, then a
 more accurate IDF can improve the quality I would think. Implementation wise
 the Weight.value would need to be updated and might raise questions
 about the normalizing of other query parts, but I am not sure about this
 now.


Well after discussing this with my colleague David Carmel who pointed out
that summing the IDFs actually makes sense because each IDF is *nearly* a
log of the nDocs/DF and so summing the nearly logs is (nearly) the log of
the multiplication (of (1+nDocs/DF)). So I don't anymore see here a problem
to fix or an immediate oportunity to explore...




 Well I hope this makes sense - I will update the Wiki page with similar
 info...

 Also:
 
  Normalized term-frequency, as in Juru.
  Here, tf(freq) is normalized by the average term frequency of the
  document.
 
  I've never seen this mentioned anywhere except here and once here on the
  ML (was it you who mentioned this?), but this sounds intuitive.


 Yes I think I mentioned this - I think it is not our idea - Juru uses it
 but it was used before in the SMART system - see Length Normalization in
 Degraded Text Collections (1995) -
 http://citeseer.ist.psu.edu/100699.html, and New Retrieval Approaches
 Using SMART : TREC 4 - http://citeseer.ist.psu.edu/144841.html.


  What do others think?
  Otis
 




Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by PaulElschot

2008-02-06 Thread Paul Elschot
Oh well, I ticked the remove trailing white space box.
The only real addition is at the end:

* Easier and more efficient ways to add proximity scoring?
 +For example specialize Span-Near-Query for the case when all subqueries 
 are terms.

Regards,
Paul Elschot



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1167) add compatibility statement to README.txt for all contribs

2008-02-06 Thread Hoss Man (JIRA)
add compatibility statement to README.txt for all contribs
--

 Key: LUCENE-1167
 URL: https://issues.apache.org/jira/browse/LUCENE-1167
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Reporter: Hoss Man
 Fix For: 2.9


as discussed on the mailing list, all contribs are not created equally, and we 
should including the comments about the backwards compatibility of each contrib 
in the README.txt before the next release

http://www.nabble.com/Back-Compatibility-to14918202.html#a14918202

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: detected corrupted index / performance improvement

2008-02-06 Thread Michael McCandless


robert engels wrote:

Do we have any way of determining if a segment is definitely OK/ 
VALID ?


The only way I know is the CheckIndex tool, and it's rather slow (and
it's not clear that it always catches all corruption).


If so, a much more efficient transactional system could be developed.

Serialize the updates to a log file. Sync the log. Update the  
lucene index WITHOUT any sync.  Log file writing/sync is VERY  
efficient since it is sequential, and a single file.


Upon open of the index, detect if index was not shutdown cleanly.  
If so, determine the last valid segment, delete the bad segments,  
and then perform the updates (from the log file) since the last  
valid segment was written.


The detection could be a VERY slow operation, but this is ok, since  
it should be rare, and then you will only pay this price on the  
rare occasion, not on every update.


Wouldn't you still need to sync periodically, so you can prune the
transaction log?  Else your transaction log is growing as fast as the
index?  (You've doubled disk usage).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: detected corrupted index / performance improvement

2008-02-06 Thread DM Smith


On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote:



robert engels wrote:

Do we have any way of determining if a segment is definitely OK/ 
VALID ?


The only way I know is the CheckIndex tool, and it's rather slow (and
it's not clear that it always catches all corruption).


Just a thought. It seems that the discussion has revolved around  
whether a crash or similar event has left the file in an inconsistent  
state. Without looking into how it is actually done, I'm going to  
guess that the writing is done from the start of the file to its end.  
That is, no out of order writing.


If this is the case, how about adding a marker to the end of the file  
of a known size and pattern. If it is present then it is presumed that  
there were no errors in getting to that point.


Even with out of order writing, one could write an 'INVALID' marker at  
the beginning of the operation and then upon reaching the end of the  
writing, replace it with the valid marker.


If neither marker is found then the index is one from before the  
capability was added and nothing can be said about the validity.


-- DM

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
Yes, but this pruning could be more efficient. On a background  
thread, get current segment from segments file, call the system wide  
sync ( e.g. System.exec(fsync), then you can purge the transaction  
logs for all segments up to that one. Since it is a background  
operation, you are not blocking the writing of new segments and tx logs.


On Feb 6, 2008, at 4:42 PM, Michael McCandless wrote:



robert engels wrote:

Do we have any way of determining if a segment is definitely OK/ 
VALID ?


The only way I know is the CheckIndex tool, and it's rather slow (and
it's not clear that it always catches all corruption).


If so, a much more efficient transactional system could be developed.

Serialize the updates to a log file. Sync the log. Update the  
lucene index WITHOUT any sync.  Log file writing/sync is VERY  
efficient since it is sequential, and a single file.


Upon open of the index, detect if index was not shutdown cleanly.  
If so, determine the last valid segment, delete the bad segments,  
and then perform the updates (from the log file) since the last  
valid segment was written.


The detection could be a VERY slow operation, but this is ok,  
since it should be rare, and then you will only pay this price on  
the rare occasion, not on every update.


Wouldn't you still need to sync periodically, so you can prune the
transaction log?  Else your transaction log is growing as fast as the
index?  (You've doubled disk usage).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: detected corrupted index / performance improvement

2008-02-06 Thread Mark Miller

Hey DM,

Just to recap an earlier thread, you need the sync and you need hardware 
that doesn't lie to you about the result of the sync.


Here is an excerpt about Digg running into that issue:

They had problems with their storage system telling them writes were on 
disk when they really weren't. Controllers do this to improve the 
appearance of their performance. But what it does is leave a giant data 
integrity whole in failure scenarios. This is really a pretty common 
problem and can be hard to fix, depending on your hardware setup.


There is a lot of good stuff relating to this in the discussion 
surrounding the JIRA issue.


robert engels wrote:
That doesn't help, with lazy writing/buffering by the OS, there is no 
guarantee that if the last written block is ok, that earlier blocks in 
the file are


The OS/drive is going to physically write them in the most efficient 
manner. Only after a sync would this hold true (which is what we are 
trying to avoid).


On Feb 6, 2008, at 5:15 PM, DM Smith wrote:



On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote:



robert engels wrote:

Do we have any way of determining if a segment is definitely 
OK/VALID ?


The only way I know is the CheckIndex tool, and it's rather slow (and
it's not clear that it always catches all corruption).


Just a thought. It seems that the discussion has revolved around 
whether a crash or similar event has left the file in an inconsistent 
state. Without looking into how it is actually done, I'm going to 
guess that the writing is done from the start of the file to its end. 
That is, no out of order writing.


If this is the case, how about adding a marker to the end of the file 
of a known size and pattern. If it is present then it is presumed 
that there were no errors in getting to that point.


Even with out of order writing, one could write an 'INVALID' marker 
at the beginning of the operation and then upon reaching the end of 
the writing, replace it with the valid marker.


If neither marker is found then the index is one from before the 
capability was added and nothing can be said about the validity.


-- DM

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote:

 I assume that Google also has distributed index over their
 GFS/MapReduce implementation. Any idea how they achieve this?

 J.D.



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Nigel Daley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566406#action_12566406
 ] 

Nigel Daley commented on LUCENE-1157:
-

{quote}
job/Lucene-trunk/ws/ sounds like a temporary work space, that might be erased 
during builds
{quote}

Yup, that's exactly what it is.

I've updated Lucene-trunk build to grab trunk/build/docs/changes/* at the end 
of the build and save them as artifacts.

 Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
 read by Lucene users)
 -

 Key: LUCENE-1157
 URL: https://issues.apache.org/jira/browse/LUCENE-1157
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 2.4

 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
 lucene-1157.patch


 Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exists) if there is enough interest. Making Solr work on top of such
a system could be an important goal and SOLR-303 is a big part of it in that
case.

I should have made it clear that disjoint data sets are not a requirement of
the system.


On Feb 6, 2008 12:57 PM, Ian Holsman [EMAIL PROTECTED] wrote:

 Hi.
 AOL has a couple of projects going on in the lucene/hadoop/solr space,
 and we will be pushing more stuff out as we can. We don't have anything
 going with solr over hadoop at the moment.

 I'm not sure if this would be better than what SOLR-303 does, but you
 should have a look at the work being done there.

 One of the things you mentioned is that the data sets are disjoint.
 SOLR-303 doesn't require this, and allows us to have a document stored
 in multiple shards (with different caching/update characteristics).




Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Andrzej Bialecki

(trimming excessive cc-s)

Ning Li wrote:

No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote:


I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?


I'm pretty sure that MapReduce/GFS/BigTable is used only for creating 
the index (as well as crawling, data mining, web graph analysis, static 
scoring etc). The overhead of MR jobs is just too high.


Their impressive search response times are most likely the result of 
extensive caching of pre-computed partial hit lists for frequent terms 
and phrases - at least that's what I suspect after reading this paper 
(not by Google folks, but very enlightening): 
http://citeseer.ist.psu.edu/724464.html


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I'm pretty sure that what you describe is the case, specially taking into
consideration that PageRank (what drives their search results) is a per
document value that is probably recomputed after some long time interval. I
did see a MapReduce algorithm to compute PageRank as well. However I do
think they must be distributing the query load across many many machines.

I also think that limiting flat results of the top 10 and then do paging is
optimized for performance. Yet another reason why Google has not implemented
facets browsing or real-time clustering around their result set.

J.D.

On Feb 6, 2008 4:22 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 (trimming excessive cc-s)

 Ning Li wrote:
  No. I'm curious too. :)
 
  On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote:
 
  I assume that Google also has distributed index over their
  GFS/MapReduce implementation. Any idea how they achieve this?

 I'm pretty sure that MapReduce/GFS/BigTable is used only for creating
 the index (as well as crawling, data mining, web graph analysis, static
 scoring etc). The overhead of MR jobs is just too high.

 Their impressive search response times are most likely the result of
 extensive caching of pre-computed partial hit lists for frequent terms
 and phrases - at least that's what I suspect after reading this paper
 (not by Google folks, but very enlightening):
 http://citeseer.ist.psu.edu/724464.html

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: detected corrupted index / performance improvement

2008-02-06 Thread DM Smith


On Feb 6, 2008, at 6:42 PM, Mark Miller wrote:


Hey DM,

Just to recap an earlier thread, you need the sync and you need  
hardware that doesn't lie to you about the result of the sync.


Here is an excerpt about Digg running into that issue:

They had problems with their storage system telling them writes  
were on disk when they really weren't. Controllers do this to  
improve the appearance of their performance. But what it does is  
leave a giant data integrity whole in failure scenarios. This is  
really a pretty common problem and can be hard to fix, depending on  
your hardware setup.


There is a lot of good stuff relating to this in the discussion  
surrounding the JIRA issue.


I guess I can take that dull tool out of my tool box. :(

BTW, I followed the thread and the Jira discussion, but I missed that.




robert engels wrote:
That doesn't help, with lazy writing/buffering by the OS, there is  
no guarantee that if the last written block is ok, that earlier  
blocks in the file are


The OS/drive is going to physically write them in the most  
efficient manner. Only after a sync would this hold true (which is  
what we are trying to avoid).


On Feb 6, 2008, at 5:15 PM, DM Smith wrote:



On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote:



robert engels wrote:

Do we have any way of determining if a segment is definitely OK/ 
VALID ?


The only way I know is the CheckIndex tool, and it's rather slow  
(and

it's not clear that it always catches all corruption).


Just a thought. It seems that the discussion has revolved around  
whether a crash or similar event has left the file in an  
inconsistent state. Without looking into how it is actually done,  
I'm going to guess that the writing is done from the start of the  
file to its end. That is, no out of order writing.


If this is the case, how about adding a marker to the end of the  
file of a known size and pattern. If it is present then it is  
presumed that there were no errors in getting to that point.


Even with out of order writing, one could write an 'INVALID'  
marker at the beginning of the operation and then upon reaching  
the end of the writing, replace it with the valid marker.


If neither marker is found then the index is one from before the  
capability was added and nothing can be said about the validity.


-- DM

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: detected corrupted index / performance improvement

2008-02-06 Thread Andrew Zhang
On Feb 7, 2008 7:22 AM, robert engels [EMAIL PROTECTED] wrote:

 That doesn't help, with lazy writing/buffering by the OS, there is no
 guarantee that if the last written block is ok, that earlier blocks
 in the file are

 The OS/drive is going to physically write them in the most efficient
 manner. Only after a sync would this hold true (which is what we are
 trying to avoid).


Hi, how about asynchronous commit? i.e. use a thread to sync the data.

We only need to make sure that all data are written to the storage before
the next operation?



 On Feb 6, 2008, at 5:15 PM, DM Smith wrote:

 
  On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote:
 
 
  robert engels wrote:
 
  Do we have any way of determining if a segment is definitely OK/
  VALID ?
 
  The only way I know is the CheckIndex tool, and it's rather slow (and
  it's not clear that it always catches all corruption).
 
  Just a thought. It seems that the discussion has revolved around
  whether a crash or similar event has left the file in an
  inconsistent state. Without looking into how it is actually done,
  I'm going to guess that the writing is done from the start of the
  file to its end. That is, no out of order writing.
 
  If this is the case, how about adding a marker to the end of the
  file of a known size and pattern. If it is present then it is
  presumed that there were no errors in getting to that point.
 
  Even with out of order writing, one could write an 'INVALID' marker
  at the beginning of the operation and then upon reaching the end of
  the writing, replace it with the valid marker.
 
  If neither marker is found then the index is one from before the
  capability was added and nothing can be said about the validity.
 
  -- DM
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Best regards,
Andrew Zhang

db4o - database for Android: www.db4o.com
http://zhanghuangzhu.blogspot.com/


Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
That is the problem, waiting for the full sync (of all of the segment  
files) takes quite a while... syncing a single log file is much more  
efficient.


On Feb 6, 2008, at 9:41 PM, Andrew Zhang wrote:


On Feb 7, 2008 7:22 AM, robert engels [EMAIL PROTECTED] wrote:


That doesn't help, with lazy writing/buffering by the OS, there is no
guarantee that if the last written block is ok, that earlier blocks
in the file are

The OS/drive is going to physically write them in the most efficient
manner. Only after a sync would this hold true (which is what we are
trying to avoid).



Hi, how about asynchronous commit? i.e. use a thread to sync the data.

We only need to make sure that all data are written to the storage  
before

the next operation?




On Feb 6, 2008, at 5:15 PM, DM Smith wrote:



On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote:



robert engels wrote:


Do we have any way of determining if a segment is definitely OK/
VALID ?


The only way I know is the CheckIndex tool, and it's rather slow  
(and

it's not clear that it always catches all corruption).


Just a thought. It seems that the discussion has revolved around
whether a crash or similar event has left the file in an
inconsistent state. Without looking into how it is actually done,
I'm going to guess that the writing is done from the start of the
file to its end. That is, no out of order writing.

If this is the case, how about adding a marker to the end of the
file of a known size and pattern. If it is present then it is
presumed that there were no errors in getting to that point.

Even with out of order writing, one could write an 'INVALID' marker
at the beginning of the operation and then upon reaching the end of
the writing, replace it with the valid marker.

If neither marker is found then the index is one from before the
capability was added and nothing can be said about the validity.

-- DM

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Best regards,
Andrew Zhang

db4o - database for Android: www.db4o.com
http://zhanghuangzhu.blogspot.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-06 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566447#action_12566447
 ] 

Steven Rowe commented on LUCENE-1157:
-

Okay - it's available now at: 
[http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/trunk/build/docs/changes/Changes.html]

Wow, that's a looong URL.  Can we shorten that at all?  E.g.:

http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/changes/Changes.html


 Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
 read by Lucene users)
 -

 Key: LUCENE-1157
 URL: https://issues.apache.org/jira/browse/LUCENE-1157
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 2.4

 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
 lucene-1157.patch


 Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-06 Thread Thomas Peuss
Hello Steven!

Steven Rowe (JIRA) schrieb:
 
 Also, I don't see Swedish among the hyphenation data licenses - is it covered 
 in some other way?
 

I have a Swedish grammar file now. If you are interested drop me a note.
It is not that hard to generate them from the TeX files.

CU
Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]