[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526535#comment-13526535
 ] 

Robert Muir commented on LUCENE-4599:
-

{quote}
If you have ideas to efficiently compress term vectors, you're welcome!
{quote}

I think we waste space with the terms, especially prefix/suffix lengths (even 
so much so, the prefix encoding probably hurts in general for many people). 
these should likely be bulk-compressed. as you already noticed in the patch, 
frequencies are a waste too. 

flags are wasteful and stupid, but it seems like you already tried to address 
that to some extent. if we compress chunks of docs we should optimize the case 
where flags are the same. Its crazy that someone would have just positions for 
"body field" of document 2, but positions and offsets for "body field" of 
document 3. 


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-07 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526557#comment-13526557
 ] 

Adrien Grand commented on LUCENE-4599:
--

bq. I think we waste space with the terms, especially prefix/suffix lengths 
[..] these should likely be bulk-compressed

Good point.

bq. flags are wasteful and stupid, but it seems like you already tried to 
address that to some extent

I'm storing them in a packed ints array where each entry is 3 bits per value. 
I'll try to optimize when a field always has the same flags.



> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-07 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527075#comment-13527075
 ] 

David Smiley commented on LUCENE-4599:
--

Does it make sense to put this in an FST where the key is the term bytes and 
the value is what you're doing now for the positions, offsets, and payloads in 
a byte array?  The point to this is that a term dictionary is going to use much 
less space with sharing of prefixes and suffixes of words.

Or... can we simply reference the terms by ord (an int) instead of writing each 
term bytes?

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527167#comment-13527167
 ] 

Adrien Grand commented on LUCENE-4599:
--

I think a FST would not compress as much as what LZ4 or Deflate can do? But 
maybe it could speed up TermsEnum.seekCeil on large documents so it might be an 
interesting idea regarding random access speed?

bq. can we simply reference the terms by ord (an int) instead of writing each 
term bytes?

Do you mean their ords in the terms dictionary? Is that information available 
somewhere when writing/merging term vectors?

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527180#comment-13527180
 ] 

Michael McCandless commented on LUCENE-4599:


bq. Does it make sense to put this in an FST where the key is the term bytes 
and the value is what you're doing now for the positions, offsets, and payloads 
in a byte array? 

That's a neat idea :)  We should [almost] just be able to use 
MemoryPostingsFormat, since it already stores all postings in an FST.

bq. I think a FST would not compress as much as what LZ4 or Deflate can do? But 
maybe it could speed up TermsEnum.seekCeil on large documents so it might be an 
interesting idea regarding random access speed?

Likely it would not compress as well, since LZ4/Deflate are able to share 
common infix fragments too, but FST only shares prefix/suffix.  It'd be 
interesting to test ... but we should explore this (FST-backed 
TermVectorsFormat) in a new issue I think ... this issue seems awesome enough 
already :)

bq. Or... can we simply reference the terms by ord (an int) instead of writing 
each term bytes?

Using ords matching the main terms dict is a neat idea too!  It would be much 
more compact ... but, when reading the term vectors we'd need to resolve-by-ord 
against the main terms dictionary (not all postings formats support that: it's 
optional, and eg our default PF doesn't), which would likely be slower than 
today.

bq. Is that information available somewhere when writing/merging term vectors?

Unfortunately, no.  We only assign ords when it's time to flush the segment ... 
but we write term vectors "live" as we index each document.  If we changed 
that, eg buffered up term vectors, then we could get the ords when we wrote 
them.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527203#comment-13527203
 ] 

David Smiley commented on LUCENE-4599:
--

The ord reference approach seems most interesting to me, even if it's not 
workable at the moment (based on Mike's comment).  If things were changed to 
make ord's possible then there wouldn't even need to be any term information in 
term-vectors whatsoever; right?  Not even the ord (integer) itself because the 
array of each term vector is intrinsically in ord-order and aligned exactly to 
each ord; right?

Does anyone know roughly what % of term-vector storage is currently for the 
term?

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527215#comment-13527215
 ] 

Robert Muir commented on LUCENE-4599:
-

I'm not sure how much more compact ords would really be? start thinking about 
average word length, shared prefixes and so on, and long references (even 
though they could be delta-encoded since they are in order, i still imagine 3 
or 4 bytes on average if you assume a large terms dict) don't seem to save a 
lot.

I think its way more important to bulk encode the prefix/suffix lengths.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-18 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535030#comment-13535030
 ] 

Shawn Heisey commented on LUCENE-4599:
--

With the 4.1 release triage likely coming soon, I am wondering if this is ready 
to make the cut or if it needs more work.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-18 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535079#comment-13535079
 ] 

Adrien Grand commented on LUCENE-4599:
--

Hey Shawn, I'm still working actively on this issue. I made good progress 
regarding compression ratio but term vectors are more complicated than stored 
fields (with lots of corner cases like negative start offsets, negative 
lengths, fields that don't always have the same options, etc.) so I will need 
time and lots of Jenkins builds to feel comfortable making it the default term 
vectors impl. It will depend on the 4.1 release schedule but given that it's 
likely to comme rather soon and that I will have very little time to work on 
this issue until next month it will probably only make it to 4.2.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-18 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535126#comment-13535126
 ] 

Shawn Heisey commented on LUCENE-4599:
--

bq. it will probably only make it to 4.2.

I'm not surprised.  I had hoped it would make it, but there will be enough to 
do for release without working on half-baked features.  I might need to 
continue to use Solr from branch_4x even after 4.1 gets released.

Thank you for everything you've done for me personally and the entire project.


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535295#comment-13535295
 ] 

Robert Muir commented on LUCENE-4599:
-

{quote}
but term vectors are more complicated than stored fields (with lots of corner 
cases like negative start offsets, negative lengths, fields that don't always 
have the same options, etc.)
{quote}

And all of these corner cases are completely bogus with no real use cases. We 
definitely need to make the long-term investment to fix this. Its so sad this 
kinda nonsense bullshit is slowing down Adrien here. Its hard to fix... I know 
ive wasted a lot of brain cycles on trying to come up with perfect solutions. 
But we have to make some progress somehow.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-18 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557849#comment-13557849
 ] 

Shawn Heisey commented on LUCENE-4599:
--

bq. If someone with very large term vector files wanted to test this new 
format, this
would be great! I'll try on my side to perform more indexing/highlighting
benchmarks..

My indexes are pretty big, with termvectors taking up a lot of that. The 3.5.0 
version of each of my shards is about 21GB. The same index in 4.1 with 
compressed stored fields is a little lres than 17 GB. I will give this patch a 
try on branch_4x. The full import will take 7-8 hours.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-18 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557889#comment-13557889
 ] 

Shawn Heisey commented on LUCENE-4599:
--

I should ask - will this be on by default in Solr with the patch?  I just got 
the patch applied to 4.1 because I already had it, decided to try it before 
branch_4x.  It has occurred to me that as a LUCENE issue, it might not be 
turned on for Solr.


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558057#comment-13558057
 ] 

Shawn Heisey commented on LUCENE-4599:
--

in the Compressing41Codec class from the solr patch, there is a protected 
member.  I had to make that public or hundreds of tests were failing.  I don't 
know if that was the right thing to do, but it allowed the tests to proceed 
without immediate failures.  If they all pass after that change, I'll do 
another full-import.

{noformat}
[junit4:junit4]> Throwable #1: java.util.ServiceConfigurationError: Cannot 
instantiate SPI class: org.apache.solr.core.Compressing41Codec
[junit4:junit4]> Caused by: java.lang.IllegalAccessException: Class 
org.apache.lucene.util.NamedSPILoader can not access a member of class 
org.apache.solr.core.Compressing41Codec with modifiers "protected"

[junit4:junit4]> Throwable #1: java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.lucene.codecs.Codec
[junit4:junit4]>at 
org.apache.lucene.util.TestRuleSetupAndRestoreClassEnv.before(TestRuleSetupAndRestoreClassEnv.java:137)
{noformat}


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, 
> solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558063#comment-13558063
 ] 

Uwe Schindler commented on LUCENE-4599:
---

Yes, the constructor must be public :-)

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, 
> solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558079#comment-13558079
 ] 

Shawn Heisey commented on LUCENE-4599:
--

Another test run produced another dataimport failure.  I went ahead and put the 
new version into place and updated my solrconfig.xml in the same way that the 
patch updated the example, now I have begun a full-import.  I'm not sure the 
new format took - Solr didn't complain about the format of my existing indexes 
like I expected.


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558089#comment-13558089
 ] 

Uwe Schindler commented on LUCENE-4599:
---

bq. Another test run produced another dataimport failure.

Could be the DST change in Fidji today, see mailing list. I disabled the test.

bq. I'm not sure the new format took - Solr didn't complain about the format of 
my existing indexes like I expected.

The CodecFactory in the patch produces a new index format, which is marked by 
another header ("Compressing41"). An existing index has the codec id "Lucene41" 
and is still readable (Lucene will use the default codec to read it).

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558140#comment-13558140
 ] 

Adrien Grand commented on LUCENE-4599:
--

I tried to reproduce the SOlr cloud error (ant test  
-Dtestcase=ClusterStateUpdateTest -Dtests.method=testCoreRegistration 
-Dtests.seed=F72E8E946F6EBEAF -Dtests.nightly=true -Dtests.weekly=true 
-Dtests.slow=true -Dtests.locale=sr_RS -Dtests.timezone=Africa/Bamako 
-Dtests.file.encoding=UTF-8) but it succceeded, so I assume it's a random Solr 
cloud failure due to the fact that your machine was very busy?

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558148#comment-13558148
 ] 

Shawn Heisey commented on LUCENE-4599:
--

bq. so I assume it's a random Solr cloud failure due to the fact that your 
machine was very busy?

That's my assumption too.  My worry is that a real SolrCloud might fail in a 
similar way if the machines involved become really busy.  If the test is 
designed to have much tighter constraints than a typical production cloud, then 
it might not be something I have to worry about.


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558157#comment-13558157
 ] 

Shawn Heisey commented on LUCENE-4599:
--

My full import just finished.  The indexes built with a patched Solr from 
lucene_solr_4_1 are pretty much the same size as they were before.The 
indexes below that end in 0 are the recently built cores that are now live.  
The ones that end in 1 are the ones built with an unmodified 4.1.

{noformat}
ncindex@bigindy5 /index/solr4/data $ du -sc *
762820  inc_0
742984  inc_1
24  ncmain
17210772s0_0
17214504s0_1
17211784s1_0
17143900s1_1
17191632s2_0
17190108s2_1
17192292s3_0
17188164s3_1
17198920s4_0
1728s4_1
17205092s5_0
17205800s5_1
207858804   total
{noformat}


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-19 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558164#comment-13558164
 ] 

Shawn Heisey commented on LUCENE-4599:
--

You can ignore the previous comment.  I thought I had added the codec line to 
solrconfig.xml -- turns out that I edited the solrconfig from the 3.5.0 
directory, not the 4.1 directory.  On my dev server, the 3.5 solr isn't even 
running.  Now that I've got the right config changed, a new full-import looks 
like it's probably working - there are now two termvector files per segment 
instead of three.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-20 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558245#comment-13558245
 ] 

Shawn Heisey commented on LUCENE-4599:
--

New files are 53.2% of the old size.
New TV files total 3890809739.
Old TV files total 7311612548.

{noformat}
Unmodified Solr 4.1:
total 17154140
drwxr-xr-x 2 ncindex ncindex  45056 Jan 19 21:35 ./
drwxr-xr-x 4 ncindex ncindex   4096 Jan 18 20:15 ../
-rw-r--r-- 1 ncindex ncindex 99 Jan 19 21:28 segments_dt
-rw-r--r-- 1 ncindex ncindex 20 Jan 19 21:28 segments.gen
-rw-r--r-- 1 ncindex ncindex 3220362314 Jan 19 21:11 _uk.fdt
-rw-r--r-- 1 ncindex ncindex1796091 Jan 19 21:11 _uk.fdx
-rw-r--r-- 1 ncindex ncindex   3291 Jan 19 21:28 _uk.fnm
-rw-r--r-- 1 ncindex ncindex 2712855241 Jan 19 21:23 _uk_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex 2641242950 Jan 19 21:23 _uk_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex 1605874308 Jan 19 21:23 _uk_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   35091811 Jan 19 21:23 _uk_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex115 Jan 19 21:28 _uk_nrm.cfe
-rw-r--r-- 1 ncindex ncindex   36874222 Jan 19 21:28 _uk_nrm.cfs
-rw-r--r-- 1 ncindex ncindex473 Jan 19 21:28 _uk.si
-rw-r--r-- 1 ncindex ncindex   24581897 Jan 19 21:28 _uk.tvd
-rw-r--r-- 1 ncindex ncindex 7090368538 Jan 19 21:28 _uk.tvf
-rw-r--r-- 1 ncindex ncindex  196662113 Jan 19 21:28 _uk.tvx

Solr 4.1 with patch:
total 13812100
drwxr-xr-x 2 ncindex ncindex  53248 Jan 20 06:10 ./
drwxr-xr-x 4 ncindex ncindex   4096 Jan 18 20:15 ../
-rw-r--r-- 1 ncindex ncindex 3220492130 Jan 20 05:54 _1oy.fdt
-rw-r--r-- 1 ncindex ncindex1790533 Jan 20 05:54 _1oy.fdx
-rw-r--r-- 1 ncindex ncindex   3291 Jan 20 06:10 _1oy.fnm
-rw-r--r-- 1 ncindex ncindex 2713448546 Jan 20 06:08 _1oy_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex 2640844965 Jan 20 06:08 _1oy_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex 1604289094 Jan 20 06:08 _1oy_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   34910618 Jan 20 06:08 _1oy_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex115 Jan 20 06:10 _1oy_nrm.cfe
-rw-r--r-- 1 ncindex ncindex   36874183 Jan 20 06:10 _1oy_nrm.cfs
-rw-r--r-- 1 ncindex ncindex477 Jan 20 06:10 _1oy.si
-rw-r--r-- 1 ncindex ncindex 3889805695 Jan 20 06:10 _1oy.tvd
-rw-r--r-- 1 ncindex ncindex1004044 Jan 20 06:10 _1oy.tvx
-rw-r--r-- 1 ncindex ncindex 20 Jan 20 06:10 segments.gen
-rw-r--r-- 1 ncindex ncindex105 Jan 20 06:10 segments_ul
-rw-r--r-- 1 ncindex ncindex  0 Jan 19 21:39 write.lock
{noformat}

For this listing, the _0 and _1 indexes have been swapped - now the _1 indexes 
are live.

{noformat}
ncindex@bigindy5 /index/solr4/data $ du -sc *
492 inc_0
609212  inc_1
24  ncmain
17154980s0_0
13840212s0_1
17211000s1_0
13913260s1_1
17191660s2_0
13895536s2_1
17192320s3_0
13889920s3_1
17198940s4_0
13897380s4_1
17205112s5_0
13918936s5_1
187118984   total
{noformat}


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-20 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558401#comment-13558401
 ] 

Adrien Grand commented on LUCENE-4599:
--

OK, I think I understood: I had forgotten to turn debug off, and although 
documents in this collection are rather big, queries tend to favor small docs, 
whose chunks contain more documents (up to 30). I ran the benchmark again with 
a very small chunk size (128) so that chunks would likely contain a single doc 
and results got better :
{noformat}
  Fuzzy2   94.39  (7.8%)   88.33  (7.5%)   
-6.4% ( -20% -9%)
 MedTerm  292.09  (2.7%)  279.01  (2.6%)   
-4.5% (  -9% -0%)
  OrHighHigh   76.84  (7.4%)   73.58  (5.8%)   
-4.2% ( -16% -9%)
  Fuzzy1   93.07  (4.8%)   89.59  (4.4%)   
-3.7% ( -12% -5%)
   OrHighMed   69.23  (6.4%)   67.17  (4.9%)   
-3.0% ( -13% -8%)
  HighPhrase8.54  (9.4%)8.36 (11.6%)   
-2.1% ( -21% -   20%)
   LowPhrase  125.02  (2.5%)  122.91  (3.4%)   
-1.7% (  -7% -4%)
   MedPhrase   39.97  (5.3%)   39.58  (7.6%)   
-1.0% ( -13% -   12%)
HighTerm  177.70  (2.4%)  176.21  (2.2%)   
-0.8% (  -5% -3%)
 LowTerm  370.26  (3.7%)  367.36  (2.8%)   
-0.8% (  -7% -5%)
   OrHighLow  106.08  (5.2%)  105.41  (4.7%)   
-0.6% ( -10% -9%)
 LowSloppyPhrase   71.29  (5.2%)   70.95  (5.3%)   
-0.5% ( -10% -   10%)
HighSloppyPhrase   30.52  (5.6%)   30.39  (5.2%)   
-0.4% ( -10% -   10%)
PKLookup  339.12  (3.0%)  338.09  (3.1%)   
-0.3% (  -6% -5%)
 MedSloppyPhrase   71.13  (4.2%)   70.95  (4.4%)   
-0.3% (  -8% -8%)
  AndHighLow  259.19  (3.8%)  258.54  (5.1%)   
-0.2% (  -8% -8%)
 Respell   69.04  (3.7%)   68.92  (3.2%)   
-0.2% (  -6% -6%)
 AndHighHigh   74.49  (1.5%)   74.47  (1.8%)   
-0.0% (  -3% -3%)
Wildcard  157.16  (2.0%)  157.21  (1.9%)
0.0% (  -3% -3%)
  AndHighMed   79.81  (2.1%)   80.16  (1.6%)
0.4% (  -3% -4%)
 MedSpanNear   14.09  (3.6%)   14.16  (4.4%)
0.5% (  -7% -8%)
 Prefix3  281.17  (2.7%)  282.85  (2.5%)
0.6% (  -4% -5%)
HighSpanNear7.73  (3.9%)7.79  (2.8%)
0.8% (  -5% -7%)
  IntNRQ  143.14  (3.0%)  144.45  (3.2%)
0.9% (  -5% -7%)
 LowSpanNear   23.85  (6.6%)   24.36  (6.0%)
2.2% (  -9% -   15%)
{noformat}

(Decreasing the chunk size from 16KB to 128 made the compression ratio increase 
from 66% to 68%.)

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-20 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558418#comment-13558418
 ] 

Adrien Grand commented on LUCENE-4599:
--

I tried to compute the compression ratio of the term vector files compared to 
Lucene40TVF for small docs (the wikipedia 1K docs) based on the chunk size (the 
patch has 2^14 as a default chunk size):
|| Chunk size || no options || positions + offsets ||
| 2^7 | 0.79 | 0.68 |
| 2^8 | 0.79 | 0.68 |
| 2^9 | 0.75 | 0.66 |
| 2^10| 0.73 | 0.65 |
| 2^11| 0.70 | 0.63 |
| 2^12| 0.68 | 0.62 |
| 2^13| 0.65 | 0.60 |
| 2^14| 0.63 | 0.59 |
| 2^15| 0.62 | 0.58 |
| 2^16| 0.62 | 0.59 |
| 2^17| 0.62 | 0.58 |

Interestingly, raising the chunk size above 2^14 doesn't bring much. 2^11 or 
2^12 look like good candidates for the default size if we were to make this TVF 
the default one (making big documents likely to be alone in their chunks and 
preventing small docs from raising the compression ratio).



> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-20 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558435#comment-13558435
 ] 

Adrien Grand commented on LUCENE-4599:
--

If there's no objection, I plan to commit soon.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-21 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559058#comment-13559058
 ] 

Commit Tag Bot commented on LUCENE-4599:


[branch_4x commit] Adrien Grand
http://svn.apache.org/viewvc?view=revision&revision=1436584

LUCENE-4599: New compressed TVF impl: CompressingTermVectorsFormat (merged from 
r1436556).



> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-21 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559059#comment-13559059
 ] 

Commit Tag Bot commented on LUCENE-4599:


[trunk commit] Adrien Grand
http://svn.apache.org/viewvc?view=revision&revision=1436556

LUCENE-4599: New compressed TVF impl: CompressingTermVectorsFormat.



> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-21 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559432#comment-13559432
 ] 

Commit Tag Bot commented on LUCENE-4599:


[branch_4x commit] Robert Muir
http://svn.apache.org/viewvc?view=revision&revision=1436764

LUCENE-4599: fix Compressing vectors to not return a docsAndPositions when it 
has no prox


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-21 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559435#comment-13559435
 ] 

Commit Tag Bot commented on LUCENE-4599:


[trunk commit] Robert Muir
http://svn.apache.org/viewvc?view=revision&revision=1436765

LUCENE-4599: fix Compressing vectors to not return a docsAndPositions when it 
has no prox


> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559524#comment-13559524
 ] 

Markus Jelsma commented on LUCENE-4599:
---

Great reduction! Is this going to be enabled in the default Lucene 41 codec 
that already compresses stored fields?

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559527#comment-13559527
 ] 

Adrien Grand commented on LUCENE-4599:
--

Unfortunately it is too late for Lucene 4.1 and anyway this new format still 
requires a lot of testing, but I plan to propose to make it the default term 
vectors format for Lucene 4.2, so yes Lucene 4.2 might compress both stored 
fields and term vectors.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559528#comment-13559528
 ] 

Markus Jelsma commented on LUCENE-4599:
---

Alright. Do you already have an issue filed for making it default in trunk so i 
can watch? We use recent Solr/Lucene trunk check outs and are interested in how 
this affects stuff.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559530#comment-13559530
 ] 

Adrien Grand commented on LUCENE-4599:
--

Not yet. I'm leaving some time for the Jenkins instances to find bugs (for 
example, one of them found a little bug last night that Robert had to fix) and 
for people to criticize/fix/improve the format.

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4599) Compressed term vectors

2013-01-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559540#comment-13559540
 ] 

Markus Jelsma commented on LUCENE-4599:
---

Alright. Looking forward to that. Thanks Adrien!

> Compressed term vectors
> ---
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/termvectors
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.2
>
> Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org