[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1166: - Attachment: de.xml A hyphenation grammar. You can download them from: http://downloads.sourceforge.net/offo/offo-hyphenation.zip?modtime=1168687306big_mirror=0 A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Thomas Peuss Attachments: CompoundTokenFilter.patch, de.xml A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens. An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter Schiff. I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project. My question now: Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1166: - Attachment: hyphenation.dtd The DTD describing the hyphenation grammar XML files. A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Thomas Peuss Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens. An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter Schiff. I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project. My question now: Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1166) A tokenfilter to decompose compound words
A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Thomas Peuss Attachments: CompoundTokenFilter.patch A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens. An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter Schiff. I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project. My question now: Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1166: - Attachment: CompoundTokenFilter.patch A preliminary version of the token filter. A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Thomas Peuss Attachments: CompoundTokenFilter.patch A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens. An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter Schiff. I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project. My question now: Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by DoronCohen
Hey Doron, I see you recommend that we think about making SweetSpot the default similarity. Do you have numbers showing for running that alone? Or for that matter, any of the other combinations of #3 individually? Thanks, Grant On Jan 31, 2008, at 4:09 AM, Doron Cohen wrote: Hi Otis, On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Doron - this looks super useful! Can you give an example for the lexical affinities you mention here? (Juru creates posting lists for lexical affinities) Sure, - simply put, denote {X} as the posting list of term X, then for a query - A B C D - in addition to the four posting lists {A}, {B}, {C}, {D} which are processed ignoring position info (i.e. Lucene's termDocs()) Juru also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C}, {B,D} and {C,D} in which a (virtual) term {X,Y} is said to exist in a document D if the two words X and Y are found in that document within a sliding window of size L (say 5). (You can also require LA's in order which is useful in some scenarios.) Juru's tokenization detects sentences and so the two words must be in the same sentence. The term-freq of that LA-term in the doc is as usual the number of matches in that doc satisfying this sliding window rule. The IDF of this term is not known in advance, and so it is first estimated based on the DF of X and Y, and this estimate is later tuned as more documents are processed and more statistics are available. You can see the resemblance to SpanNear queries. Note that the IDF of this virtual term is going to be high and as such it is focusing the query search on the more relevant documents. In my Lucene implementation for this I used a window size of 7, and note that (1) there was no sentence boundaries knowledge in my Lucene implementation and (2) the IDF was fixed all along, estimated by the involved terms IDF, as computed once in SpanNear query. The default computation is their sum. This is in most cases too low an IDF, I think. Phrase query btw behaves the same. So in both cases (Phrase, Span) I think it would be interesting to experiment with adaptive IDF computation that updates the IDF as more documents are processed. When the query is made of only a single span or only a single phrase element this is a waste of time. But when the query is more complex (as the query we built) and you have in the query both multi-term parts and single-term parts, or several multi-term parts, then a more accurate IDF can improve the quality I would think. Implementation wise the Weight.value would need to be updated and might raise questions about the normalizing of other query parts, but I am not sure about this now. Well I hope this makes sense - I will update the Wiki page with similar info... Also: Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document. I've never seen this mentioned anywhere except here and once here on the ML (was it you who mentioned this?), but this sounds intuitive. Yes I think I mentioned this - I think it is not our idea - Juru uses it but it was used before in the SMART system - see Length Normalization in Degraded Text Collections (1995) - http://citeseer.ist.psu.edu/100699.html , and New Retrieval Approaches Using SMART : TREC 4 - http://citeseer.ist.psu.edu/144841.html. What do others think? Otis -- Grant Ingersoll http://lucene.grantingersoll.com http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566163#action_12566163 ] Steven Rowe commented on LUCENE-1157: - If I browse to [http://hudson.zones.apache.org/hudson/job/Lucene-trunk/changes/] , or anything in that directory, including Changes.html, I see a Hudson page dedicated to per-nightly-build commits. Nice page :). I'm guessing what's going on is a namespace issue: on the hudson server, anything you put into {{Lucene-trunk/changes/}} is unlinkable-to, because that directory is dedicated to the Hudson Changes page. Fixing this may be as simple as changing the name of the target directory, maybe to {{official-changes/}} or something like that. Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users) - Key: LUCENE-1157 URL: https://issues.apache.org/jira/browse/LUCENE-1157 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 2.4 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, lucene-1157.patch Background in http://www.nabble.com/formatable-changes-log-tt15078749.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566167#action_12566167 ] Doron Cohen commented on LUCENE-1157: - I suspected something like this but wasn't sure... Ok I'll rename the directory and then we'll see. Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users) - Key: LUCENE-1157 URL: https://issues.apache.org/jira/browse/LUCENE-1157 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 2.4 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, lucene-1157.patch Background in http://www.nabble.com/formatable-changes-log-tt15078749.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-997) Add search timeout support to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-997: --- Attachment: timeout.patch Attached patch corrects default resolution comment. Add search timeout support to Lucene Key: LUCENE-997 URL: https://issues.apache.org/jira/browse/LUCENE-997 Project: Lucene - Java Issue Type: New Feature Reporter: Sean Timm Priority: Minor Attachments: HitCollectorTimeoutDecorator.java, LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java This patch is based on Nutch-308. This patch adds support for a maximum search time limit. After this time is exceeded, the search thread is stopped, partial results (if any) are returned and the total number of results is estimated. This patch tries to minimize the overhead related to time-keeping by using a version of safe unsynchronized timer. This was also discussed in an e-mail thread. http://www.nabble.com/search-timeout-tf3410206.html#a9501029 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-997) Add search timeout support to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-997: --- Attachment: timeout.patch Add search timeout support to Lucene Key: LUCENE-997 URL: https://issues.apache.org/jira/browse/LUCENE-997 Project: Lucene - Java Issue Type: New Feature Reporter: Sean Timm Priority: Minor Attachments: HitCollectorTimeoutDecorator.java, LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java This patch is based on Nutch-308. This patch adds support for a maximum search time limit. After this time is exceeded, the search thread is stopped, partial results (if any) are returned and the total number of results is estimated. This patch tries to minimize the overhead related to time-keeping by using a version of safe unsynchronized timer. This was also discussed in an e-mail thread. http://www.nabble.com/search-timeout-tf3410206.html#a9501029 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-997) Add search timeout support to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566175#action_12566175 ] Doron Cohen commented on LUCENE-997: Oh wrote comment that was before I decided to change the default... Thanks for catching this. Add search timeout support to Lucene Key: LUCENE-997 URL: https://issues.apache.org/jira/browse/LUCENE-997 Project: Lucene - Java Issue Type: New Feature Reporter: Sean Timm Priority: Minor Attachments: HitCollectorTimeoutDecorator.java, LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java This patch is based on Nutch-308. This patch adds support for a maximum search time limit. After this time is exceeded, the search thread is stopped, partial results (if any) are returned and the total number of results is estimated. This patch tries to minimize the overhead related to time-keeping by using a version of safe unsynchronized timer. This was also discussed in an e-mail thread. http://www.nabble.com/search-timeout-tf3410206.html#a9501029 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-997) Add search timeout support to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566171#action_12566171 ] Sean Timm commented on LUCENE-997: -- Doron, your comment for setResolution(long) says The default timer resolution is 50 milliseconds, however, the default is actually 20 ms (public static final int DEFAULT_RESOLUTION = 20;). Other than that, everything looks great. Add search timeout support to Lucene Key: LUCENE-997 URL: https://issues.apache.org/jira/browse/LUCENE-997 Project: Lucene - Java Issue Type: New Feature Reporter: Sean Timm Priority: Minor Attachments: HitCollectorTimeoutDecorator.java, LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java This patch is based on Nutch-308. This patch adds support for a maximum search time limit. After this time is exceeded, the search thread is stopped, partial results (if any) are returned and the total number of results is estimated. This patch tries to minimize the overhead related to time-keeping by using a version of safe unsynchronized timer. This was also discussed in an e-mail thread. http://www.nabble.com/search-timeout-tf3410206.html#a9501029 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566188#action_12566188 ] Steven Rowe commented on LUCENE-1166: - Hi Thomas, Looking at [http://offo.sourceforge.net/hyphenation/licenses.html], which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project. Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way? A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Thomas Peuss Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens. An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter Schiff. I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project. My question now: Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566206#action_12566206 ] Nigel Daley commented on LUCENE-1157: - I suggest you save the Changes.html as one of the build artifacts (just like the tar.gz files are saved). Grant can add this file to the artifacts list in the Hudson configuration screen if you want this done. Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users) - Key: LUCENE-1157 URL: https://issues.apache.org/jira/browse/LUCENE-1157 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 2.4 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, lucene-1157.patch Background in http://www.nabble.com/formatable-changes-log-tt15078749.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words
[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566220#action_12566220 ] Thomas Peuss commented on LUCENE-1166: -- bq. Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project. This is true. And that's why I uploaded the two files without the ASF license grant. The FOP project does not have the files in the code base as well because of the licensing problem. bq. Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way? OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX grammar files. I have a look into this tonight. All other hyphenation implementations I have found so far use them either directly or in an converted variant like the FOP code. What we can do of course is to ask the authors of the LaTeX files if they want to license their work under the ASF license as well. It is worth a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I try to contact the authors of the German grammar files tomorrow. BTW: an example for those that don't want to try the patch: +Input token stream:+ Rindfleischüberwachungsgesetz Drahtschere abba +Output token stream:+ (Rindfleischüberwachungsgesetz,0,29) (Rind,0,4,posIncr=0) (fleisch,4,11,posIncr=0) (überwachung,11,22,posIncr=0) (gesetz,23,29,posIncr=0) (Drahtschere,30,41) (Draht,30,35,posIncr=0) (schere,35,41,posIncr=0) (abba,42,46) A tokenfilter to decompose compound words - Key: LUCENE-1166 URL: https://issues.apache.org/jira/browse/LUCENE-1166 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Thomas Peuss Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens. An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter Schiff. I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project. My question now: Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene-based Distributed Index Leveraging Hadoop
There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's Index Server Project Proposal at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html 2) Solr's Distributed Search at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's Distributed Lucene at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed index is partitioned into shards. Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more shard servers. Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A master keeps track of the shard servers and the shards being served by them. An application updates and queries the global index through an index client. An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version of the index. ShardServerCommand sendHeartbeat(); QUERYING THE INDEX To query the index, an application sends a search request to an index client. The index client then calls the shard server search() method for each shard of the index, merges the results and returns them to the application. The index client caches the mapping between shards and shard servers by periodically calling the master's getShardLocations() method. UPDATING THE INDEX USING MAP/REDUCE To update the index, an application sends an update request to an index client. The index client then calls the master's updateShards() method, which schedules a Map/Reduce job to update the index. The Map/Reduce job updates the shards in parallel and copies the new index files of each shard (i.e., Lucene instance) to HDFS. The updateShards() method includes a configuration, which provides information for updating the shards. More specifically, the configuration includes the following information: - Input path. This provides the location of updated documents, e.g., HDFS files or directories, or HBase tables. - Input formatter. This specifies how to format the input documents. - Analysis. This defines the analyzer to use on the input. The analyzer determines whether a document is being inserted, updated, or deleted. For inserts or updates, the analyzer also converts each input document into a Lucene document. The Map phase of the Map/Reduce job formats and analyzes the input (in parallel), while the Reduce phase collects and applies the updates to each Lucene instance (again in parallel). The updates are applied using the local file system where a Reduce task runs and then copied back to HDFS. For example, if the updates caused a new Lucene segment to be created, the new segment would be created on the local file system first, and then copied back to HDFS. When the Map/Reduce job completes, a new version of the index is ready to be queried. It is important to note that the new version of the index is not derived from scratch. By leveraging Lucene's update algorithm, the new version of each Lucene instance will share as many files as possible as the previous version. ENSURING INDEX CONSISTENCY At any point in time, an index client always has
Re: Lucene-based Distributed Index Leveraging Hadoop
I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D. On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote: There seem to be a few other players in this space too. Are you from Rackspace? (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- query-terabytes-data) AOL also has a Hadoop/Solr project going on. CNET does not have much brewing there. Although Yonik and I had talked about it a bunch -- but that was long ago. --cw Clay Webster tel:1.908.541.3724 Associate VP, Platform Infrastructure http://www.cnet.com CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] -Original Message- From: Ning Li [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 06, 2008 1:57 PM To: [EMAIL PROTECTED]; java-dev@lucene.apache.org; solr- [EMAIL PROTECTED] Subject: Lucene-based Distributed Index Leveraging Hadoop There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's Index Server Project Proposal at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html 2) Solr's Distributed Search at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's Distributed Lucene at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed index is partitioned into shards. Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more shard servers. Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A master keeps track of the shard servers and the shards being served by them. An application updates and queries the global index through an index client. An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version of the index. ShardServerCommand sendHeartbeat(); QUERYING THE INDEX To query the index, an application sends a search request to an index client. The index client then calls the shard server search() method for each shard of the index, merges the results and returns them to the application. The index client caches the mapping between shards and shard servers by periodically calling the master's getShardLocations() method. UPDATING THE INDEX USING MAP/REDUCE To update the index, an application sends an update request to an index client. The index client then calls the master's updateShards() method, which schedules a Map/Reduce job to update the index. The Map/Reduce job updates the shards in parallel and copies the new index files of each shard (i.e., Lucene instance) to HDFS. The updateShards() method includes a configuration, which provides information for updating the shards. More specifically, the configuration includes the following information: - Input path. This provides the location of updated documents, e.g., HDFS
Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by DoronCohen
Hi Grant, yes I have these combinations - I just updated the wiki page with these numbers. I still have the index as described,allowing to try other ideas that may come up, or if we need more tests (on GOV2 data) to take better decisions ... Cheers, Doron On Wed, Feb 6, 2008 at 2:15 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Hey Doron, I see you recommend that we think about making SweetSpot the default similarity. Do you have numbers showing for running that alone? Or for that matter, any of the other combinations of #3 individually? Thanks, Grant On Jan 31, 2008, at 4:09 AM, Doron Cohen wrote: Hi Otis, On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Doron - this looks super useful! Can you give an example for the lexical affinities you mention here? (Juru creates posting lists for lexical affinities) Sure, - simply put, denote {X} as the posting list of term X, then for a query - A B C D - in addition to the four posting lists {A}, {B}, {C}, {D} which are processed ignoring position info (i.e. Lucene's termDocs()) Juru also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C}, {B,D} and {C,D} in which a (virtual) term {X,Y} is said to exist in a document D if the two words X and Y are found in that document within a sliding window of size L (say 5). (You can also require LA's in order which is useful in some scenarios.) Juru's tokenization detects sentences and so the two words must be in the same sentence. The term-freq of that LA-term in the doc is as usual the number of matches in that doc satisfying this sliding window rule. The IDF of this term is not known in advance, and so it is first estimated based on the DF of X and Y, and this estimate is later tuned as more documents are processed and more statistics are available. You can see the resemblance to SpanNear queries. Note that the IDF of this virtual term is going to be high and as such it is focusing the query search on the more relevant documents. In my Lucene implementation for this I used a window size of 7, and note that (1) there was no sentence boundaries knowledge in my Lucene implementation and (2) the IDF was fixed all along, estimated by the involved terms IDF, as computed once in SpanNear query. The default computation is their sum. This is in most cases too low an IDF, I think. Phrase query btw behaves the same. So in both cases (Phrase, Span) I think it would be interesting to experiment with adaptive IDF computation that updates the IDF as more documents are processed. When the query is made of only a single span or only a single phrase element this is a waste of time. But when the query is more complex (as the query we built) and you have in the query both multi-term parts and single-term parts, or several multi-term parts, then a more accurate IDF can improve the quality I would think. Implementation wise the Weight.value would need to be updated and might raise questions about the normalizing of other query parts, but I am not sure about this now. Well I hope this makes sense - I will update the Wiki page with similar info... Also: Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document. I've never seen this mentioned anywhere except here and once here on the ML (was it you who mentioned this?), but this sounds intuitive. Yes I think I mentioned this - I think it is not our idea - Juru uses it but it was used before in the SMART system - see Length Normalization in Degraded Text Collections (1995) - http://citeseer.ist.psu.edu/100699.html , and New Retrieval Approaches Using SMART : TREC 4 - http://citeseer.ist.psu.edu/144841.html. What do others think? Otis -- Grant Ingersoll http://lucene.grantingersoll.com http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by DoronCohen
On Thu, Jan 31, 2008 at 11:09 AM, Doron Cohen [EMAIL PROTECTED] wrote: Hi Otis, On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Doron - this looks super useful! Can you give an example for the lexical affinities you mention here? (Juru creates posting lists for lexical affinities) Sure, - simply put, denote {X} as the posting list of term X, then for a query - A B C D - in addition to the four posting lists {A}, {B}, {C}, {D} which are processed ignoring position info (i.e. Lucene's termDocs()) Juru also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C}, {B,D} and {C,D} in which a (virtual) term {X,Y} is said to exist in a document D if the two words X and Y are found in that document within a sliding window of size L (say 5). The wiki page now has a more complete example. (You can also require LA's in order which is useful in some scenarios.) Juru's tokenization detects sentences and so the two words must be in the same sentence. The term-freq of that LA-term in the doc is as usual the number of matches in that doc satisfying this sliding window rule. The IDF of this term is not known in advance, and so it is first estimated based on the DF of X and Y, and this estimate is later tuned as more documents are processed and more statistics are available. This was not so accurate a description. What Juru really does is compute in advance the first e.g. 1MB of the LA posting and use its computed IDF for the entire posting. Experiments with more accurate adaptive computation (for longer LA postings) showed no advantae over this simpler approach. You can see the resemblance to SpanNear queries. Note that the IDF of this virtual term is going to be high and as such it is focusing the query search on the more relevant documents. In my Lucene implementation for this I used a window size of 7, and note that (1) there was no sentence boundaries knowledge in my Lucene implementation and (2) the IDF was fixed all along, estimated by the involved terms IDF, as computed once in SpanNear query. The default computation is their sum. This is in most cases too low an IDF, I think. Phrase query btw behaves the same. So in both cases (Phrase, Span) I think it would be interesting to experiment with adaptive IDF computation that updates the IDF as more documents are processed. When the query is made of only a single span or only a single phrase element this is a waste of time. But when the query is more complex (as the query we built) and you have in the query both multi-term parts and single-term parts, or several multi-term parts, then a more accurate IDF can improve the quality I would think. Implementation wise the Weight.value would need to be updated and might raise questions about the normalizing of other query parts, but I am not sure about this now. Well after discussing this with my colleague David Carmel who pointed out that summing the IDFs actually makes sense because each IDF is *nearly* a log of the nDocs/DF and so summing the nearly logs is (nearly) the log of the multiplication (of (1+nDocs/DF)). So I don't anymore see here a problem to fix or an immediate oportunity to explore... Well I hope this makes sense - I will update the Wiki page with similar info... Also: Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document. I've never seen this mentioned anywhere except here and once here on the ML (was it you who mentioned this?), but this sounds intuitive. Yes I think I mentioned this - I think it is not our idea - Juru uses it but it was used before in the SMART system - see Length Normalization in Degraded Text Collections (1995) - http://citeseer.ist.psu.edu/100699.html, and New Retrieval Approaches Using SMART : TREC 4 - http://citeseer.ist.psu.edu/144841.html. What do others think? Otis
Re: [Lucene-java Wiki] Update of TREC 2007 Million Queries Track - IBM Haifa Team by PaulElschot
Oh well, I ticked the remove trailing white space box. The only real addition is at the end: * Easier and more efficient ways to add proximity scoring? +For example specialize Span-Near-Query for the case when all subqueries are terms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1167) add compatibility statement to README.txt for all contribs
add compatibility statement to README.txt for all contribs -- Key: LUCENE-1167 URL: https://issues.apache.org/jira/browse/LUCENE-1167 Project: Lucene - Java Issue Type: Task Components: contrib/* Reporter: Hoss Man Fix For: 2.9 as discussed on the mailing list, all contribs are not created equally, and we should including the comments about the backwards compatibility of each contrib in the README.txt before the next release http://www.nabble.com/Back-Compatibility-to14918202.html#a14918202 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: detected corrupted index / performance improvement
robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). If so, a much more efficient transactional system could be developed. Serialize the updates to a log file. Sync the log. Update the lucene index WITHOUT any sync. Log file writing/sync is VERY efficient since it is sequential, and a single file. Upon open of the index, detect if index was not shutdown cleanly. If so, determine the last valid segment, delete the bad segments, and then perform the updates (from the log file) since the last valid segment was written. The detection could be a VERY slow operation, but this is ok, since it should be rare, and then you will only pay this price on the rare occasion, not on every update. Wouldn't you still need to sync periodically, so you can prune the transaction log? Else your transaction log is growing as fast as the index? (You've doubled disk usage). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: detected corrupted index / performance improvement
On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought. It seems that the discussion has revolved around whether a crash or similar event has left the file in an inconsistent state. Without looking into how it is actually done, I'm going to guess that the writing is done from the start of the file to its end. That is, no out of order writing. If this is the case, how about adding a marker to the end of the file of a known size and pattern. If it is present then it is presumed that there were no errors in getting to that point. Even with out of order writing, one could write an 'INVALID' marker at the beginning of the operation and then upon reaching the end of the writing, replace it with the valid marker. If neither marker is found then the index is one from before the capability was added and nothing can be said about the validity. -- DM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: detected corrupted index / performance improvement
Yes, but this pruning could be more efficient. On a background thread, get current segment from segments file, call the system wide sync ( e.g. System.exec(fsync), then you can purge the transaction logs for all segments up to that one. Since it is a background operation, you are not blocking the writing of new segments and tx logs. On Feb 6, 2008, at 4:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). If so, a much more efficient transactional system could be developed. Serialize the updates to a log file. Sync the log. Update the lucene index WITHOUT any sync. Log file writing/sync is VERY efficient since it is sequential, and a single file. Upon open of the index, detect if index was not shutdown cleanly. If so, determine the last valid segment, delete the bad segments, and then perform the updates (from the log file) since the last valid segment was written. The detection could be a VERY slow operation, but this is ok, since it should be rare, and then you will only pay this price on the rare occasion, not on every update. Wouldn't you still need to sync periodically, so you can prune the transaction log? Else your transaction log is growing as fast as the index? (You've doubled disk usage). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: detected corrupted index / performance improvement
Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup. There is a lot of good stuff relating to this in the discussion surrounding the JIRA issue. robert engels wrote: That doesn't help, with lazy writing/buffering by the OS, there is no guarantee that if the last written block is ok, that earlier blocks in the file are The OS/drive is going to physically write them in the most efficient manner. Only after a sync would this hold true (which is what we are trying to avoid). On Feb 6, 2008, at 5:15 PM, DM Smith wrote: On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought. It seems that the discussion has revolved around whether a crash or similar event has left the file in an inconsistent state. Without looking into how it is actually done, I'm going to guess that the writing is done from the start of the file to its end. That is, no out of order writing. If this is the case, how about adding a marker to the end of the file of a known size and pattern. If it is present then it is presumed that there were no errors in getting to that point. Even with out of order writing, one could write an 'INVALID' marker at the beginning of the operation and then upon reaching the end of the writing, replace it with the valid marker. If neither marker is found then the index is one from before the capability was added and nothing can be said about the validity. -- DM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene-based Distributed Index Leveraging Hadoop
No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D.
[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566406#action_12566406 ] Nigel Daley commented on LUCENE-1157: - {quote} job/Lucene-trunk/ws/ sounds like a temporary work space, that might be erased during builds {quote} Yup, that's exactly what it is. I've updated Lucene-trunk build to grab trunk/build/docs/changes/* at the end of the build and save them as artifacts. Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users) - Key: LUCENE-1157 URL: https://issues.apache.org/jira/browse/LUCENE-1157 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 2.4 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, lucene-1157.patch Background in http://www.nabble.com/formatable-changes-log-tt15078749.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene-based Distributed Index Leveraging Hadoop
One main focus is to provide fault-tolerance in this distributed index system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging results from multiple shards right now. We'd like to start an open source project for a fault-tolerant distributed index system (or join if one already exists) if there is enough interest. Making Solr work on top of such a system could be an important goal and SOLR-303 is a big part of it in that case. I should have made it clear that disjoint data sets are not a requirement of the system. On Feb 6, 2008 12:57 PM, Ian Holsman [EMAIL PROTECTED] wrote: Hi. AOL has a couple of projects going on in the lucene/hadoop/solr space, and we will be pushing more stuff out as we can. We don't have anything going with solr over hadoop at the moment. I'm not sure if this would be better than what SOLR-303 does, but you should have a look at the work being done there. One of the things you mentioned is that the data sets are disjoint. SOLR-303 doesn't require this, and allows us to have a document stored in multiple shards (with different caching/update characteristics).
Re: Lucene-based Distributed Index Leveraging Hadoop
(trimming excessive cc-s) Ning Li wrote: No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? I'm pretty sure that MapReduce/GFS/BigTable is used only for creating the index (as well as crawling, data mining, web graph analysis, static scoring etc). The overhead of MR jobs is just too high. Their impressive search response times are most likely the result of extensive caching of pre-computed partial hit lists for frequent terms and phrases - at least that's what I suspect after reading this paper (not by Google folks, but very enlightening): http://citeseer.ist.psu.edu/724464.html -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene-based Distributed Index Leveraging Hadoop
I'm pretty sure that what you describe is the case, specially taking into consideration that PageRank (what drives their search results) is a per document value that is probably recomputed after some long time interval. I did see a MapReduce algorithm to compute PageRank as well. However I do think they must be distributing the query load across many many machines. I also think that limiting flat results of the top 10 and then do paging is optimized for performance. Yet another reason why Google has not implemented facets browsing or real-time clustering around their result set. J.D. On Feb 6, 2008 4:22 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: (trimming excessive cc-s) Ning Li wrote: No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? I'm pretty sure that MapReduce/GFS/BigTable is used only for creating the index (as well as crawling, data mining, web graph analysis, static scoring etc). The overhead of MR jobs is just too high. Their impressive search response times are most likely the result of extensive caching of pre-computed partial hit lists for frequent terms and phrases - at least that's what I suspect after reading this paper (not by Google folks, but very enlightening): http://citeseer.ist.psu.edu/724464.html -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: detected corrupted index / performance improvement
On Feb 6, 2008, at 6:42 PM, Mark Miller wrote: Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup. There is a lot of good stuff relating to this in the discussion surrounding the JIRA issue. I guess I can take that dull tool out of my tool box. :( BTW, I followed the thread and the Jira discussion, but I missed that. robert engels wrote: That doesn't help, with lazy writing/buffering by the OS, there is no guarantee that if the last written block is ok, that earlier blocks in the file are The OS/drive is going to physically write them in the most efficient manner. Only after a sync would this hold true (which is what we are trying to avoid). On Feb 6, 2008, at 5:15 PM, DM Smith wrote: On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought. It seems that the discussion has revolved around whether a crash or similar event has left the file in an inconsistent state. Without looking into how it is actually done, I'm going to guess that the writing is done from the start of the file to its end. That is, no out of order writing. If this is the case, how about adding a marker to the end of the file of a known size and pattern. If it is present then it is presumed that there were no errors in getting to that point. Even with out of order writing, one could write an 'INVALID' marker at the beginning of the operation and then upon reaching the end of the writing, replace it with the valid marker. If neither marker is found then the index is one from before the capability was added and nothing can be said about the validity. -- DM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: detected corrupted index / performance improvement
On Feb 7, 2008 7:22 AM, robert engels [EMAIL PROTECTED] wrote: That doesn't help, with lazy writing/buffering by the OS, there is no guarantee that if the last written block is ok, that earlier blocks in the file are The OS/drive is going to physically write them in the most efficient manner. Only after a sync would this hold true (which is what we are trying to avoid). Hi, how about asynchronous commit? i.e. use a thread to sync the data. We only need to make sure that all data are written to the storage before the next operation? On Feb 6, 2008, at 5:15 PM, DM Smith wrote: On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought. It seems that the discussion has revolved around whether a crash or similar event has left the file in an inconsistent state. Without looking into how it is actually done, I'm going to guess that the writing is done from the start of the file to its end. That is, no out of order writing. If this is the case, how about adding a marker to the end of the file of a known size and pattern. If it is present then it is presumed that there were no errors in getting to that point. Even with out of order writing, one could write an 'INVALID' marker at the beginning of the operation and then upon reaching the end of the writing, replace it with the valid marker. If neither marker is found then the index is one from before the capability was added and nothing can be said about the validity. -- DM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Best regards, Andrew Zhang db4o - database for Android: www.db4o.com http://zhanghuangzhu.blogspot.com/
Re: detected corrupted index / performance improvement
That is the problem, waiting for the full sync (of all of the segment files) takes quite a while... syncing a single log file is much more efficient. On Feb 6, 2008, at 9:41 PM, Andrew Zhang wrote: On Feb 7, 2008 7:22 AM, robert engels [EMAIL PROTECTED] wrote: That doesn't help, with lazy writing/buffering by the OS, there is no guarantee that if the last written block is ok, that earlier blocks in the file are The OS/drive is going to physically write them in the most efficient manner. Only after a sync would this hold true (which is what we are trying to avoid). Hi, how about asynchronous commit? i.e. use a thread to sync the data. We only need to make sure that all data are written to the storage before the next operation? On Feb 6, 2008, at 5:15 PM, DM Smith wrote: On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought. It seems that the discussion has revolved around whether a crash or similar event has left the file in an inconsistent state. Without looking into how it is actually done, I'm going to guess that the writing is done from the start of the file to its end. That is, no out of order writing. If this is the case, how about adding a marker to the end of the file of a known size and pattern. If it is present then it is presumed that there were no errors in getting to that point. Even with out of order writing, one could write an 'INVALID' marker at the beginning of the operation and then upon reaching the end of the writing, replace it with the valid marker. If neither marker is found then the index is one from before the capability was added and nothing can be said about the validity. -- DM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Best regards, Andrew Zhang db4o - database for Android: www.db4o.com http://zhanghuangzhu.blogspot.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566447#action_12566447 ] Steven Rowe commented on LUCENE-1157: - Okay - it's available now at: [http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/trunk/build/docs/changes/Changes.html] Wow, that's a looong URL. Can we shorten that at all? E.g.: http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/changes/Changes.html Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users) - Key: LUCENE-1157 URL: https://issues.apache.org/jira/browse/LUCENE-1157 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 2.4 Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, lucene-1157.patch Background in http://www.nabble.com/formatable-changes-log-tt15078749.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words
Hello Steven! Steven Rowe (JIRA) schrieb: Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way? I have a Swedish grammar file now. If you are interested drop me a note. It is not that hard to generate them from the TeX files. CU Thomas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]