[jira] Updated: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2248:
--

Attachment: LUCENE-2248.patch

Here my first patch. Please tell me which name for the static constant should 
be used, I use CURRENT_VERSION. Maybe something with "test" in it?

I transformed TestCharArraySet as a demonstation.

> Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
> when development for 3.2 starts
> -
>
> Key: LUCENE-2248
> URL: https://issues.apache.org/jira/browse/LUCENE-2248
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis, contrib/*, contrib/analyzers, 
> contrib/benchmark, contrib/highlighter, contrib/spatial, 
> contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
> Query/Scoring, QueryParser, Search, Store, Term Vectors
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2248.patch
>
>
> A lot of tests for the most-recent functionality in Lucene use 
> Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
> version without hassle and changing this in later versions.
> The problem is, if we copy these tests to backwards branch after 3.1 is out 
> and then start to improve analyzers, we then will have the maintenance hell 
> for backwards tests. And we loose backward compatibility testing for older 
> versions. If we would specify a specific version like LUCENE_31 in our tests, 
> after moving to backwards they must work without any changes!
> To not always modify all tests after a new version comes out (e.g. after 
> switching to 3.2 dev), I propose to do the following:
> - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
> better) Version.LUCENE_31 in LuceneTestCase(4J).
> - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
> use this constant and remove unneeded import statements.
> When we then move the tests to backward we must only change one line, 
> depending on how we define this constant:
> - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
> backwards branch to use the version numer of the released thing.
> - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
> need to change backwards, but instead when switching version numbers we just 
> move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830543#action_12830543
 ] 

Simon Willnauer commented on LUCENE-2248:
-

Patch looks good uwe, I think we should reflect its purpose in the name maybe 
TEST_VERSION_LATEST or TEST_VERSION_CURRENT

simon 

> Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
> when development for 3.2 starts
> -
>
> Key: LUCENE-2248
> URL: https://issues.apache.org/jira/browse/LUCENE-2248
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis, contrib/*, contrib/analyzers, 
> contrib/benchmark, contrib/highlighter, contrib/spatial, 
> contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
> Query/Scoring, QueryParser, Search, Store, Term Vectors
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2248.patch
>
>
> A lot of tests for the most-recent functionality in Lucene use 
> Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
> version without hassle and changing this in later versions.
> The problem is, if we copy these tests to backwards branch after 3.1 is out 
> and then start to improve analyzers, we then will have the maintenance hell 
> for backwards tests. And we loose backward compatibility testing for older 
> versions. If we would specify a specific version like LUCENE_31 in our tests, 
> after moving to backwards they must work without any changes!
> To not always modify all tests after a new version comes out (e.g. after 
> switching to 3.2 dev), I propose to do the following:
> - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
> better) Version.LUCENE_31 in LuceneTestCase(4J).
> - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
> use this constant and remove unneeded import statements.
> When we then move the tests to backward we must only change one line, 
> depending on how we define this constant:
> - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
> backwards branch to use the version numer of the released thing.
> - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
> need to change backwards, but instead when switching version numbers we just 
> move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2251) Change contrib tests to use the special LuceneTestCase(J4) constant for the current version used a matchVersion parameter

2010-02-06 Thread Uwe Schindler (JIRA)
Change contrib tests to use the special LuceneTestCase(J4) constant for the 
current version used a matchVersion parameter
-

 Key: LUCENE-2251
 URL: https://issues.apache.org/jira/browse/LUCENE-2251
 Project: Lucene - Java
  Issue Type: Sub-task
Reporter: Uwe Schindler
Assignee: Simon Willnauer


Sub issue for contrib changes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830547#action_12830547
 ] 

Uwe Schindler commented on LUCENE-2248:
---

Simon: I opened a sub-issue for contrib and assigned you to it!

I will change to TEST_VERSION_CURRENT and then run eclipse to do the 
refactoring.

> Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
> when development for 3.2 starts
> -
>
> Key: LUCENE-2248
> URL: https://issues.apache.org/jira/browse/LUCENE-2248
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis, contrib/*, contrib/analyzers, 
> contrib/benchmark, contrib/highlighter, contrib/spatial, 
> contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
> Query/Scoring, QueryParser, Search, Store, Term Vectors
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2248.patch
>
>
> A lot of tests for the most-recent functionality in Lucene use 
> Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
> version without hassle and changing this in later versions.
> The problem is, if we copy these tests to backwards branch after 3.1 is out 
> and then start to improve analyzers, we then will have the maintenance hell 
> for backwards tests. And we loose backward compatibility testing for older 
> versions. If we would specify a specific version like LUCENE_31 in our tests, 
> after moving to backwards they must work without any changes!
> To not always modify all tests after a new version comes out (e.g. after 
> switching to 3.2 dev), I propose to do the following:
> - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
> better) Version.LUCENE_31 in LuceneTestCase(4J).
> - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
> use this constant and remove unneeded import statements.
> When we then move the tests to backward we must only change one line, 
> depending on how we define this constant:
> - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
> backwards branch to use the version numer of the released thing.
> - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
> need to change backwards, but instead when switching version numbers we just 
> move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-2248:
-

Assignee: Uwe Schindler

> Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
> when development for 3.2 starts
> -
>
> Key: LUCENE-2248
> URL: https://issues.apache.org/jira/browse/LUCENE-2248
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis, contrib/*, contrib/analyzers, 
> contrib/benchmark, contrib/highlighter, contrib/spatial, 
> contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
> Query/Scoring, QueryParser, Search, Store, Term Vectors
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2248.patch
>
>
> A lot of tests for the most-recent functionality in Lucene use 
> Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
> version without hassle and changing this in later versions.
> The problem is, if we copy these tests to backwards branch after 3.1 is out 
> and then start to improve analyzers, we then will have the maintenance hell 
> for backwards tests. And we loose backward compatibility testing for older 
> versions. If we would specify a specific version like LUCENE_31 in our tests, 
> after moving to backwards they must work without any changes!
> To not always modify all tests after a new version comes out (e.g. after 
> switching to 3.2 dev), I propose to do the following:
> - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
> better) Version.LUCENE_31 in LuceneTestCase(4J).
> - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
> use this constant and remove unneeded import statements.
> When we then move the tests to backward we must only change one line, 
> depending on how we define this constant:
> - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
> backwards branch to use the version numer of the released thing.
> - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
> need to change backwards, but instead when switching version numbers we just 
> move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2248:
--

Attachment: LUCENE-2248.patch

Patch with updated constant name.

Simon, if you like you can use it as basis and start with contrib. 

> Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
> when development for 3.2 starts
> -
>
> Key: LUCENE-2248
> URL: https://issues.apache.org/jira/browse/LUCENE-2248
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis, contrib/*, contrib/analyzers, 
> contrib/benchmark, contrib/highlighter, contrib/spatial, 
> contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
> Query/Scoring, QueryParser, Search, Store, Term Vectors
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2248.patch, LUCENE-2248.patch
>
>
> A lot of tests for the most-recent functionality in Lucene use 
> Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
> version without hassle and changing this in later versions.
> The problem is, if we copy these tests to backwards branch after 3.1 is out 
> and then start to improve analyzers, we then will have the maintenance hell 
> for backwards tests. And we loose backward compatibility testing for older 
> versions. If we would specify a specific version like LUCENE_31 in our tests, 
> after moving to backwards they must work without any changes!
> To not always modify all tests after a new version comes out (e.g. after 
> switching to 3.2 dev), I propose to do the following:
> - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
> better) Version.LUCENE_31 in LuceneTestCase(4J).
> - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
> use this constant and remove unneeded import statements.
> When we then move the tests to backward we must only change one line, 
> depending on how we define this constant:
> - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
> backwards branch to use the version numer of the released thing.
> - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
> need to change backwards, but instead when switching version numbers we just 
> move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830550#action_12830550
 ] 

Simon Willnauer commented on LUCENE-2248:
-

bq. Simon, if you like you can use it as basis and start with contrib. 
will do...

> Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
> when development for 3.2 starts
> -
>
> Key: LUCENE-2248
> URL: https://issues.apache.org/jira/browse/LUCENE-2248
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis, contrib/*, contrib/analyzers, 
> contrib/benchmark, contrib/highlighter, contrib/spatial, 
> contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
> Query/Scoring, QueryParser, Search, Store, Term Vectors
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2248.patch, LUCENE-2248.patch
>
>
> A lot of tests for the most-recent functionality in Lucene use 
> Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
> version without hassle and changing this in later versions.
> The problem is, if we copy these tests to backwards branch after 3.1 is out 
> and then start to improve analyzers, we then will have the maintenance hell 
> for backwards tests. And we loose backward compatibility testing for older 
> versions. If we would specify a specific version like LUCENE_31 in our tests, 
> after moving to backwards they must work without any changes!
> To not always modify all tests after a new version comes out (e.g. after 
> switching to 3.2 dev), I propose to do the following:
> - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
> better) Version.LUCENE_31 in LuceneTestCase(4J).
> - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
> use this constant and remove unneeded import statements.
> When we then move the tests to backward we must only change one line, 
> depending on how we define this constant:
> - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
> backwards branch to use the version numer of the released thing.
> - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
> need to change backwards, but instead when switching version numbers we just 
> move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)
stored field retrieve slow
--

 Key: LUCENE-2252
 URL: https://issues.apache.org/jira/browse/LUCENE-2252
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 3.0
Reporter: John Wang


IndexReader.document() on a stored field is rather slow. Did a simple 
multi-threaded test and profiled it:

40+% time is spent in getting the offset from the index file
30+% time is spent in reading the count (e.g. number of fields to load)

Although I ran it on my lap top where the disk isn't that great, but still 
seems to be much room in improvement, e.g. load field index file into memory 
(for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
other stuff being loaded)

A related note, are there plans to have custom segments as part of flexible 
indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830564#action_12830564
 ] 

Robert Muir commented on LUCENE-2252:
-

John, couldnt you simply write your own Directory if you want to put the fdx in 
RAM? I am not sure about 'peanuts', some people may not to pay 8 bytes/doc or 
whatever it is for this stored field offset, when the memory could be used 
better for other purposes.


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830582#action_12830582
 ] 

Uwe Schindler commented on LUCENE-2252:
---

FileSwitchDirectory comes into my mind. Just delegate the *.fdx extension into 
RAMDirectory. On instantiation of the dir, create the copy during wrapping with 
FileSwitchDir.

> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830599#action_12830599
 ] 

John Wang commented on LUCENE-2252:
---

Thanks Uwe for the pointer. Will check that out!

Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of 
data per doc. This memory would be less than the data structure needed to be 
held in memory for only 1 field cache entry for sort. I understand it is always 
better to use less memory, but sometimes we do have to make trade-off decisions.
But you are right, different applications have different needs/requirements, so 
having support for custom segments would be a good thing. e.g. LUCENE-1914

> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830603#action_12830603
 ] 

Yonik Seeley commented on LUCENE-2252:
--

The thing about stored fields is that it's normally not inner-loop stuff.  The 
index may be 100M documents, but the average application pages through hits a 
handful at a time.  And when loading stored fields gets really slow, it tends 
to be the OS cache misses due to the index being large.  We should still 
optimize it if we can of course (some apps do access many fields at once), but 
I agree with Robert that a direct in-memory stored field index probably 
wouldn't be a good default.

John, do you have a specific use case where this is the bottleneck, or are you 
just looking for places to optimize in general?

> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830615#action_12830615
 ] 

Robert Muir commented on LUCENE-2252:
-

bq. Robert, we can get away with 4 bytes per doc assuming we are not storing 
2GB of data per doc

I do not understand, I think the fdx index is the raw offset into fdt for some 
doc, and must remain a long if you have more than 2GB total across all docs.


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830627#action_12830627
 ] 

John Wang commented on LUCENE-2252:
---

bq. I do not understand, I think the fdx index is the raw offset into fdt for 
some doc, and must remain a long if you have more than 2GB total across all 
docs.

as stated earlier,  assuming we are not storing 2GB of data per doc, you don't 
need to keep a long per doc. There are many ways of representing this without 
paying much performance penalty. Off the top of my head, this would work:

since positions are always positive, you can indicate using the first bit to 
see if MAX_INT is reached, if so, add MAX_INT to the masked bits. You get away 
with int per doc.

I am sure with there are other tons of neat stuff for this the Mikes or Yonik 
can come up with :)

bq. John, do you have a specific use case where this is the bottleneck, or are 
you just looking for places to optimize in general?

Hi Yonik, I understand this may not be a common use case. I am trying to use 
Lucene as a store solution. e.g. supporting just get()/put() operations as a 
content store. We wrote something simple in house and I compared it against 
lucene, and the difference was dramatic. So after profiling, just seems this is 
an area with lotsa room for improvement. (posted earlier)

Reasons:
1) Our current setup is that the content is stored outside of the search 
cluster. It just seems being able to fetch the data for rendering/highlighting 
within our search cluster would be good.
2) If the index contains the original data, changing indexing schema, e.g. 
reindexing can be done within each partition/node. Getting data from our 
authoratative datastore is expensive.

Perhaps LUCENE-1912 is the right way to go rather than "fixing" stored fields. 
If you also agree, I can just dup it over.

Thanks

-John


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830628#action_12830628
 ] 

John Wang commented on LUCENE-2252:
---

Sorry, I meant LUCENE-1914

> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830632#action_12830632
 ] 

Robert Muir commented on LUCENE-2252:
-

bq. as stated earlier, assuming we are not storing 2GB of data per doc, you 
don't need to keep a long per doc.

right, you stated this, but even if your 'store long into an int' works, I 
still think 4 bytes/doc is too much (its too much wasted ram for virtually no 
gain)

I dont understand why you need something like a custom segment file to do this, 
why cant you just simply use Directory to load this particular file into memory 
for your use case?


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830641#action_12830641
 ] 

John Wang commented on LUCENE-2252:
---

bq. I still think 4 bytes/doc is too much (its too much wasted ram for 
virtually no gain)

That depends on the application. In modern machines (at least with the machines 
we are using, e.g. a macbook pro) we can afford it :) I am not sure I agree 
with "virtually no gain" if you look at the numbers I posted. IMHO, the gain is 
significant.

I hate to get into a subjective argument on this though.

bq. I dont understand why you need something like a custom segment file to do 
this, why cant you just simply use Directory to load this particular file into 
memory for your use case?

Having a custom segment allows me to not having to get into this subjective 
argument in what is too much memory or what is the gain, since it just depends 
on my application, right?

Furthermore, with the question at hand, even if we do use Directory 
implementation Uwe suggested, it is not optimal. For my use case, the cost of 
the seek/read for the count on the data file is very wasteful. Also even for 
getting position, I can just a random access into an array compare to a 
in-memory seek,read/parse.

The very simple store mechanism we have written outside of lucene has a gain of 
>85x, yes, 8500%, over lucene stored fields. We would like to however, take 
advantage of the some of the good stuff already in lucene, e.g.  merge 
mechanism (which is very nicely done), delete handling etc.


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830642#action_12830642
 ] 

Robert Muir commented on LUCENE-2252:
-

bq. In modern machines (at least with the machines we are using, e.g. a macbook 
pro)

its not really subjective, or based on modern machines. you are talking about 
5M documents, some indexes have a lot more documents and 4bytes/doc in ram adds 
up to a lot!
for the case of using lucene as a search engine library, this memory could be 
better spent on other things.
I dont think this is subjective, because its a search engine library, not a 
document store.

bq. Furthermore, with the question at hand, even if we do use Directory 
implementation Uwe suggested, it is not optimal.

but it is easy, and takes away your disk seek. the "in-memory seek, read/parse" 
is as you say, peanuts in comparison.


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org