Re: Lucene 2.9 in Solr 1.4

2009-06-15 Thread Aleksander M. Stensby

Sounds good!
I'd be happy to help speed things up, so if there's anything I can do,  
please let me know!


Cheers,
 Aleksander

On Fri, 12 Jun 2009 17:42:04 +0200, Yonik Seeley  
yo...@lucidimagination.com wrote:



So it looks like Lucene 2.9 has all of a sudden accelerated the
release, and it seems at this point wiser to release Solr 1.4 with
Lucene 2.9 (non-dev), assuming that goes quickly as planned.  Feels
like we're a bit behind schedule here in Solr-land anyway, so it
really doesn't seem like it would slow our release up much.

-Yonik
http://www.lucidimagination.com





--
Aleksander M. Stensby
Lead software developer and system architect
Integrasco A/S
www.integrasco.no
http://twitter.com/Integrasco

Please consider the environment before printing all or any of this e-mail


[jira] Created: (SOLR-1219) use setproxy ant task when proxy properties are specified

2009-06-15 Thread Koji Sekiguchi (JIRA)
use setproxy ant task when proxy properties are specified
-

 Key: SOLR-1219
 URL: https://issues.apache.org/jira/browse/SOLR-1219
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Priority: Trivial


Currectly, ant luke and ant example will be fail if you use proxy:

{code}
$ ant luke
build.xml:881: HTTP Authorization failure
{code}

To avoid this, use setproxy ant task when properties are specified by the user:

{code}
$ ant luke -Dproxy.host=hostname -Dproxy.port=8080 -Dproxy.user=user 
-Dproxy.password=passwd
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1219) use setproxy ant task when proxy properties are specified

2009-06-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1219:
-

Attachment: SOLR-1219.patch

 use setproxy ant task when proxy properties are specified
 -

 Key: SOLR-1219
 URL: https://issues.apache.org/jira/browse/SOLR-1219
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Priority: Trivial
 Attachments: SOLR-1219.patch


 Currectly, ant luke and ant example will be fail if you use proxy:
 {code}
 $ ant luke
 build.xml:881: HTTP Authorization failure
 {code}
 To avoid this, use setproxy ant task when properties are specified by the 
 user:
 {code}
 $ ant luke -Dproxy.host=hostname -Dproxy.port=8080 -Dproxy.user=user 
 -Dproxy.password=passwd
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1203) We should add an example of setting the update.processor for a given RequestHandler

2009-06-15 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719460#action_12719460
 ] 

Noble Paul commented on SOLR-1203:
--

update.processor is not for RequestHandler it is common across all 
requesthandlers

 We should add an example of setting the update.processor for a given 
 RequestHandler
 ---

 Key: SOLR-1203
 URL: https://issues.apache.org/jira/browse/SOLR-1203
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 1.4


 a commented out example that points to the commented out example update chain

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: solr 1.4 lite ?

2009-06-15 Thread Ian Holsman

Is bandwidth or disk space really an issue for people today ?

you should be focusing on decreasing the size and speed of indexing 
stuff not the code-base. It's not like you guys have unlimited time to 
spend on this project.


ps. if you don't have the examples in there, then people won't know that 
feature exists.


and yes. i regularly copy the example schema to create a new index. I 
know it's bad practice, and not the most efficient schema, but it 
usually has the cool features enabled in it ;-)


Noble Paul ??? ?? wrote:

+1 for solr lite

A lot of users are fine without those example stuff (dih , cell)

one option is to have two different distributions. solr.zip and a solr-min.zip

On Thu, Jun 11, 2009 at 6:59 AM, Matthew Runomr...@zappos.com wrote:
  

I'd be willing to guess that the vast majority of users start off with the
example app and customize from there to meet their needs.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jun 10, 2009, at 6:04 PM, Eric Pugh wrote:



Has anyone really complained about the size of Solr?  One of the things I
like about Solr is how simple it is to get things up and running, and how
accesible the example directory makes everything.  When I first played with
DIH and Cell, everything was there.  I didn't have to chase down .jar's from
multiple places.  Maybe if it was as simple as ant build-example-cell and
ant build-example-dih then there might not be a barrier to entry for new
users.

I'd be curious to hear what percentage of folks deploy solr based on the
example app, and how many just start out with the most stripped down
solr.war and build everything up from there?

Eric




On Jun 10, 2009, at 6:56 PM, Grant Ingersoll wrote:

  

+1

Should be easy enough to conjure up the Ant magic.

On Jun 10, 2009, at 5:55 PM, Yonik Seeley wrote:



Thanks for bringing this up Patrick... clearly it would be nice to
avoid (or mandate) 100MB downloads!

-Yonik
http://www.lucidimagination.com


On Wed, Jun 10, 2009 at 5:50 PM, patrick o'leary pj...@pjaol.com
wrote:
  

Just using the apache-solr example directory, it seems to have gotten a
bit
big
e.g.

$ du -sh *
13M apache-solr-1.3.0
92M apache-solr-1.4.0

The biggest space user being example-DIH

apache-solr-1.4.0/example
$ du -sh *
4.0KREADME.txt
5.5Mclustering
80K etc
*32M example-DIH*
42K exampleAnalysis
168Kexampledocs
13M lib
52K logs
118Kmulticore
*31M solr*
20K start.jar
12M webapps
12K work

solr/lib is now 30mb
apache-solr-1.4.0/example/solr/lib
$ ls -lhS
total 30M
-rwx-- 1 pjaol None  14M Jun 10 17:08 ooxml-schemas-1.0.jar
-rwx-- 1 pjaol None 4.3M Jun 10 17:08 icu4j-3.8.jar
-rwx-- 1 pjaol None 3.2M Jun 10 17:08 pdfbox-0.7.3.jar
-rwx-- 1 pjaol None 2.6M Jun 10 17:08 xmlbeans-2.3.0.jar
-rwx-- 1 pjaol None 1.5M Jun 10 17:08 poi-3.5-beta5.jar
-rwx-- 1 pjaol None 1.2M Jun 10 17:08 xercesImpl-2.8.1.jar
-rwx-- 1 pjaol None 1.1M Jun 10 17:08 bcprov-jdk14-132.jar

as opposed to 0 for 1.3.0

This pushes solr to over a 100mb download for features that I'm sure
can be
packaged up separately as they look seldom used.
It would make sense if there's going to be a batteries included version
to
also have a solr-lite version.

P



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search



-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal




  





  




Re: solr 1.4 lite ?

2009-06-15 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, Jun 15, 2009 at 2:32 PM, Ian Holsmanli...@holsman.net wrote:
 Is bandwidth or disk space really an issue for people today ?

 you should be focusing on decreasing the size and speed of indexing stuff
 not the code-base. It's not like you guys have unlimited time to spend on
 this project.

This does not have to be a code change . It just just has to be a new
build target  (even that is code, but...)


downloading 100MB is not very pleasant if you keep downloading
nightlies every now and then.


 ps. if you don't have the examples in there, then people won't know that
 feature exists.

 and yes. i regularly copy the example schema to create a new index. I know
 it's bad practice, and not the most efficient schema, but it usually has the
 cool features enabled in it ;-)

 Noble Paul ??? ?? wrote:

 +1 for solr lite

 A lot of users are fine without those example stuff (dih , cell)

 one option is to have two different distributions. solr.zip and a
 solr-min.zip

 On Thu, Jun 11, 2009 at 6:59 AM, Matthew Runomr...@zappos.com wrote:


 I'd be willing to guess that the vast majority of users start off with
 the
 example app and customize from there to meet their needs.

 Thanks for your time!

 Matthew Runo
 Software Engineer, Zappos.com
 mr...@zappos.com - 702-943-7833

 On Jun 10, 2009, at 6:04 PM, Eric Pugh wrote:



 Has anyone really complained about the size of Solr?  One of the things
 I
 like about Solr is how simple it is to get things up and running, and
 how
 accesible the example directory makes everything.  When I first played
 with
 DIH and Cell, everything was there.  I didn't have to chase down .jar's
 from
 multiple places.  Maybe if it was as simple as ant build-example-cell
 and
 ant build-example-dih then there might not be a barrier to entry for
 new
 users.

 I'd be curious to hear what percentage of folks deploy solr based on the
 example app, and how many just start out with the most stripped down
 solr.war and build everything up from there?

 Eric




 On Jun 10, 2009, at 6:56 PM, Grant Ingersoll wrote:



 +1

 Should be easy enough to conjure up the Ant magic.

 On Jun 10, 2009, at 5:55 PM, Yonik Seeley wrote:



 Thanks for bringing this up Patrick... clearly it would be nice to
 avoid (or mandate) 100MB downloads!

 -Yonik
 http://www.lucidimagination.com


 On Wed, Jun 10, 2009 at 5:50 PM, patrick o'leary pj...@pjaol.com
 wrote:


 Just using the apache-solr example directory, it seems to have gotten
 a
 bit
 big
 e.g.

 $ du -sh *
 13M     apache-solr-1.3.0
 92M     apache-solr-1.4.0

 The biggest space user being example-DIH

 apache-solr-1.4.0/example
 $ du -sh *
 4.0K    README.txt
 5.5M    clustering
 80K     etc
 *32M     example-DIH*
 42K     exampleAnalysis
 168K    exampledocs
 13M     lib
 52K     logs
 118K    multicore
 *31M     solr*
 20K     start.jar
 12M     webapps
 12K     work

 solr/lib is now 30mb
 apache-solr-1.4.0/example/solr/lib
 $ ls -lhS
 total 30M
 -rwx-- 1 pjaol None  14M Jun 10 17:08 ooxml-schemas-1.0.jar
 -rwx-- 1 pjaol None 4.3M Jun 10 17:08 icu4j-3.8.jar
 -rwx-- 1 pjaol None 3.2M Jun 10 17:08 pdfbox-0.7.3.jar
 -rwx-- 1 pjaol None 2.6M Jun 10 17:08 xmlbeans-2.3.0.jar
 -rwx-- 1 pjaol None 1.5M Jun 10 17:08 poi-3.5-beta5.jar
 -rwx-- 1 pjaol None 1.2M Jun 10 17:08 xercesImpl-2.8.1.jar
 -rwx-- 1 pjaol None 1.1M Jun 10 17:08 bcprov-jdk14-132.jar

 as opposed to 0 for 1.3.0

 This pushes solr to over a 100mb download for features that I'm sure
 can be
 packaged up separately as they look seldom used.
 It would make sense if there's going to be a batteries included
 version
 to
 also have a solr-lite version.

 P



 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using
 Solr/Lucene:
 http://www.lucidimagination.com/search



 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
 http://www.opensourceconnections.com
 Free/Busy: http://tinyurl.com/eric-cal
















-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


[jira] Assigned: (SOLR-1150) OutofMemoryError on enabling highlighting

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned SOLR-1150:
-

Assignee: Mark Miller

 OutofMemoryError on enabling highlighting
 -

 Key: SOLR-1150
 URL: https://issues.apache.org/jira/browse/SOLR-1150
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Siddharth Gargate
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: SOLR-1150.patch


 Please refer to following mail thread
 http://markmail.org/message/5nhkm5h3ongqlput
 I am testing with 2MB document size and just 500 documents. Indexing is 
 working fine even with 128MB heap size. But on searching Solr throws OOM 
 error. This issue is observed only when we enable highlighting. While 
 indexing I am storing 1 MB text. While searching Solr reads all the 500 
 documents in the memory. It also reads the complete 1 MB stored field in the 
 memory for all 500 documents. Due to this 500 docs * 1 MB * 2 (2 bytes per 
 char) = 1000 MB memory is required for searching.
 This memory usage can be reduced by reading one document at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: solr 1.4 lite ?

2009-06-15 Thread patrick o'leary
Those examples and features should still exist but in a batteries included
version
The lite version is just for those who want the simple basic search feature.

More than 70% of solr these days is indexing strategy or client libraries,
rather than search.
These are important, but are now beginning to clutter things up for
deployment.
It would be good to offer a lite version for those that are just upgrading,
and a simple target
to build a lite-example for those hacking away on the code or distributing a
example version.


On Mon, Jun 15, 2009 at 5:02 AM, Ian Holsman li...@holsman.net wrote:

 Is bandwidth or disk space really an issue for people today ?

 you should be focusing on decreasing the size and speed of indexing stuff
 not the code-base. It's not like you guys have unlimited time to spend on
 this project.

 ps. if you don't have the examples in there, then people won't know that
 feature exists.

 and yes. i regularly copy the example schema to create a new index. I know
 it's bad practice, and not the most efficient schema, but it usually has the
 cool features enabled in it ;-)


 Noble Paul ??? ?? wrote:

 +1 for solr lite

 A lot of users are fine without those example stuff (dih , cell)

 one option is to have two different distributions. solr.zip and a
 solr-min.zip

 On Thu, Jun 11, 2009 at 6:59 AM, Matthew Runomr...@zappos.com wrote:


 I'd be willing to guess that the vast majority of users start off with
 the
 example app and customize from there to meet their needs.

 Thanks for your time!

 Matthew Runo
 Software Engineer, Zappos.com
 mr...@zappos.com - 702-943-7833

 On Jun 10, 2009, at 6:04 PM, Eric Pugh wrote:



 Has anyone really complained about the size of Solr?  One of the things
 I
 like about Solr is how simple it is to get things up and running, and
 how
 accesible the example directory makes everything.  When I first played
 with
 DIH and Cell, everything was there.  I didn't have to chase down .jar's
 from
 multiple places.  Maybe if it was as simple as ant build-example-cell
 and
 ant build-example-dih then there might not be a barrier to entry for
 new
 users.

 I'd be curious to hear what percentage of folks deploy solr based on the
 example app, and how many just start out with the most stripped down
 solr.war and build everything up from there?

 Eric




 On Jun 10, 2009, at 6:56 PM, Grant Ingersoll wrote:



 +1

 Should be easy enough to conjure up the Ant magic.

 On Jun 10, 2009, at 5:55 PM, Yonik Seeley wrote:



 Thanks for bringing this up Patrick... clearly it would be nice to
 avoid (or mandate) 100MB downloads!

 -Yonik
 http://www.lucidimagination.com


 On Wed, Jun 10, 2009 at 5:50 PM, patrick o'leary pj...@pjaol.com
 wrote:


 Just using the apache-solr example directory, it seems to have gotten
 a
 bit
 big
 e.g.

 $ du -sh *
 13M apache-solr-1.3.0
 92M apache-solr-1.4.0

 The biggest space user being example-DIH

 apache-solr-1.4.0/example
 $ du -sh *
 4.0KREADME.txt
 5.5Mclustering
 80K etc
 *32M example-DIH*
 42K exampleAnalysis
 168Kexampledocs
 13M lib
 52K logs
 118Kmulticore
 *31M solr*
 20K start.jar
 12M webapps
 12K work

 solr/lib is now 30mb
 apache-solr-1.4.0/example/solr/lib
 $ ls -lhS
 total 30M
 -rwx-- 1 pjaol None  14M Jun 10 17:08 ooxml-schemas-1.0.jar
 -rwx-- 1 pjaol None 4.3M Jun 10 17:08 icu4j-3.8.jar
 -rwx-- 1 pjaol None 3.2M Jun 10 17:08 pdfbox-0.7.3.jar
 -rwx-- 1 pjaol None 2.6M Jun 10 17:08 xmlbeans-2.3.0.jar
 -rwx-- 1 pjaol None 1.5M Jun 10 17:08 poi-3.5-beta5.jar
 -rwx-- 1 pjaol None 1.2M Jun 10 17:08 xercesImpl-2.8.1.jar
 -rwx-- 1 pjaol None 1.1M Jun 10 17:08 bcprov-jdk14-132.jar

 as opposed to 0 for 1.3.0

 This pushes solr to over a 100mb download for features that I'm sure
 can be
 packaged up separately as they look seldom used.
 It would make sense if there's going to be a batteries included
 version
 to
 also have a solr-lite version.

 P



 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using
 Solr/Lucene:
 http://www.lucidimagination.com/search



 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
 http://www.opensourceconnections.com
 Free/Busy: http://tinyurl.com/eric-cal

















[jira] Commented: (SOLR-1150) OutofMemoryError on enabling highlighting

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719597#action_12719597
 ] 

Mark Miller commented on SOLR-1150:
---

Odd - change looks like it wouldnt affect this, but somehow the highlighter 
test fails as it attempts to access a deleted doc. Not quite sure what is up 
yet.

 OutofMemoryError on enabling highlighting
 -

 Key: SOLR-1150
 URL: https://issues.apache.org/jira/browse/SOLR-1150
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Siddharth Gargate
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: SOLR-1150.patch


 Please refer to following mail thread
 http://markmail.org/message/5nhkm5h3ongqlput
 I am testing with 2MB document size and just 500 documents. Indexing is 
 working fine even with 128MB heap size. But on searching Solr throws OOM 
 error. This issue is observed only when we enable highlighting. While 
 indexing I am storing 1 MB text. While searching Solr reads all the 500 
 documents in the memory. It also reads the complete 1 MB stored field in the 
 memory for all 500 documents. Due to this 500 docs * 1 MB * 2 (2 bytes per 
 char) = 1000 MB memory is required for searching.
 This memory usage can be reduced by reading one document at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1150) OutofMemoryError on enabling highlighting

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719598#action_12719598
 ] 

Mark Miller commented on SOLR-1150:
---

There is a problem with distrib and highlighting as well.

 OutofMemoryError on enabling highlighting
 -

 Key: SOLR-1150
 URL: https://issues.apache.org/jira/browse/SOLR-1150
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Siddharth Gargate
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: SOLR-1150.patch


 Please refer to following mail thread
 http://markmail.org/message/5nhkm5h3ongqlput
 I am testing with 2MB document size and just 500 documents. Indexing is 
 working fine even with 128MB heap size. But on searching Solr throws OOM 
 error. This issue is observed only when we enable highlighting. While 
 indexing I am storing 1 MB text. While searching Solr reads all the 500 
 documents in the memory. It also reads the complete 1 MB stored field in the 
 memory for all 500 documents. Due to this 500 docs * 1 MB * 2 (2 bytes per 
 char) = 1000 MB memory is required for searching.
 This memory usage can be reduced by reading one document at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1150) OutofMemoryError on enabling highlighting

2009-06-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719599#action_12719599
 ] 

Yonik Seeley commented on SOLR-1150:


It's trying to read the loop iterator (i.e. 0-9)  ;-)

 OutofMemoryError on enabling highlighting
 -

 Key: SOLR-1150
 URL: https://issues.apache.org/jira/browse/SOLR-1150
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Siddharth Gargate
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: SOLR-1150.patch


 Please refer to following mail thread
 http://markmail.org/message/5nhkm5h3ongqlput
 I am testing with 2MB document size and just 500 documents. Indexing is 
 working fine even with 128MB heap size. But on searching Solr throws OOM 
 error. This issue is observed only when we enable highlighting. While 
 indexing I am storing 1 MB text. While searching Solr reads all the 500 
 documents in the memory. It also reads the complete 1 MB stored field in the 
 memory for all 500 documents. Due to this 500 docs * 1 MB * 2 (2 bytes per 
 char) = 1000 MB memory is required for searching.
 This memory usage can be reduced by reading one document at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1150) OutofMemoryError on enabling highlighting

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719601#action_12719601
 ] 

Mark Miller commented on SOLR-1150:
---

ah, thanks for the spot yonik.

Ill switch it to use doc ids instead and see how things go :)

 OutofMemoryError on enabling highlighting
 -

 Key: SOLR-1150
 URL: https://issues.apache.org/jira/browse/SOLR-1150
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Siddharth Gargate
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: SOLR-1150.patch


 Please refer to following mail thread
 http://markmail.org/message/5nhkm5h3ongqlput
 I am testing with 2MB document size and just 500 documents. Indexing is 
 working fine even with 128MB heap size. But on searching Solr throws OOM 
 error. This issue is observed only when we enable highlighting. While 
 indexing I am storing 1 MB text. While searching Solr reads all the 500 
 documents in the memory. It also reads the complete 1 MB stored field in the 
 memory for all 500 documents. Due to this 500 docs * 1 MB * 2 (2 bytes per 
 char) = 1000 MB memory is required for searching.
 This memory usage can be reduced by reading one document at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719609#action_12719609
 ] 

Walter Underwood commented on SOLR-1216:


sync is a weak name, because it doesn't say whether it is a push or pull 
synchronization.


 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719620#action_12719620
 ] 

Mark Miller commented on SOLR-1216:
---

thats why I was torn between it and syncFromMaster.

Not in love with that either though. Any suggestions?

 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719625#action_12719625
 ] 

Walter Underwood commented on SOLR-1216:


If we choose a name for the thing we are pulling, like image, then we can use 
makeimage, pullimage, etc.


 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1150) OutofMemoryError on enabling highlighting

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-1150:
--

Attachment: SOLR-1150.patch

 OutofMemoryError on enabling highlighting
 -

 Key: SOLR-1150
 URL: https://issues.apache.org/jira/browse/SOLR-1150
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Siddharth Gargate
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: SOLR-1150.patch, SOLR-1150.patch


 Please refer to following mail thread
 http://markmail.org/message/5nhkm5h3ongqlput
 I am testing with 2MB document size and just 500 documents. Indexing is 
 working fine even with 128MB heap size. But on searching Solr throws OOM 
 error. This issue is observed only when we enable highlighting. While 
 indexing I am storing 1 MB text. While searching Solr reads all the 500 
 documents in the memory. It also reads the complete 1 MB stored field in the 
 memory for all 500 documents. Due to this 500 docs * 1 MB * 2 (2 bytes per 
 char) = 1000 MB memory is required for searching.
 This memory usage can be reduced by reading one document at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2009-06-15 Thread Shekhar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719677#action_12719677
 ] 

Shekhar commented on SOLR-236:
--

Here is the solfconfig file.


requestHandler name=geo class=solr.SearchHandler
lst name=defaults 
str name=echoParamsexplicit/str
/lst
 
arr name=components
  strlocalsolr/str
strcollapse/str 
/arr

/requestHandler


Following are the results I am getting :

response
−
lst name=responseHeader
int name=status0/int
int name=QTime146/int
−
lst name=params
str name=lat41.883784/str
str name=radius50/str
str name=collapse.fieldresource_id/str
str name=rows2/str
str name=indenton/str
str name=flresource_id,geo_distance/str
str name=qTV/str
str name=qtgeo/str
str name=long-87.637668/str
/lst
/lst
−
result name=response numFound=4294 start=0
−
doc
int name=resource_id10018/int
double name=geo_distance26.16691883965225/double
/doc
−
doc
int name=resource_id10102/int
double name=geo_distance39.90588996589528/double
/doc
/result
−
lst name=collapse_counts
str name=fieldresource_id/str
−
lst name=doc
int name=10022116/int
int name=117014/int
/lst
−
lst name=count
int name=10015116/int
int name=100184/int
/lst
−
lst name=debug
str name=Docset typeBitDocSet(5201)/str
long name=Total collapsing time(ms)46/long
long name=Create uncollapsed docset(ms)22/long
long name=Collapsing normal time(ms)24/long
long name=Creating collapseinfo time(ms)0/long
long name=Convert to bitset time(ms)0/long
long name=Create collapsed docset time(ms)0/long
/lst
/lst
−
result name=response numFound=5201 start=0
−
doc
int name=resource_id10015/int
/doc
−
doc
int name=resource_id10018/int
/doc
/result
/response

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
 Fix For: 1.5

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-solr-236-2.patch, 
 field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, 
 field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-236) Field collapsing

2009-06-15 Thread Shekhar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719677#action_12719677
 ] 

Shekhar edited comment on SOLR-236 at 6/15/09 3:34 PM:
---

Here is the solfconfig file.


requestHandler name=geo class=solr.SearchHandler
lst name=defaults 
str name=echoParamsexplicit/str
/lst
 
arr name=components
  strlocalsolr/str
strcollapse/str 
/arr

/requestHandler

You can get more details from http://www.gissearch.com/localsolr


===

Following are the results I am getting :

response
−
lst name=responseHeader
int name=status0/int
int name=QTime146/int
−
lst name=params
str name=lat41.883784/str
str name=radius50/str
str name=collapse.fieldresource_id/str
str name=rows2/str
str name=indenton/str
str name=flresource_id,geo_distance/str
str name=qTV/str
str name=qtgeo/str
str name=long-87.637668/str
/lst
/lst
−
result name=response numFound=4294 start=0
−
doc
int name=resource_id10018/int
double name=geo_distance26.16691883965225/double
/doc
−
doc
int name=resource_id10102/int
double name=geo_distance39.90588996589528/double
/doc
/result
−
lst name=collapse_counts
str name=fieldresource_id/str
−
lst name=doc
int name=10022116/int
int name=117014/int
/lst
−
lst name=count
int name=10015116/int
int name=100184/int
/lst
−
lst name=debug
str name=Docset typeBitDocSet(5201)/str
long name=Total collapsing time(ms)46/long
long name=Create uncollapsed docset(ms)22/long
long name=Collapsing normal time(ms)24/long
long name=Creating collapseinfo time(ms)0/long
long name=Convert to bitset time(ms)0/long
long name=Create collapsed docset time(ms)0/long
/lst
/lst
−
result name=response numFound=5201 start=0
−
doc
int name=resource_id10015/int
/doc
−
doc
int name=resource_id10018/int
/doc
/result
/response

  was (Author: csnirkhe):
Here is the solfconfig file.


requestHandler name=geo class=solr.SearchHandler
lst name=defaults 
str name=echoParamsexplicit/str
/lst
 
arr name=components
  strlocalsolr/str
strcollapse/str 
/arr

/requestHandler


Following are the results I am getting :

response
−
lst name=responseHeader
int name=status0/int
int name=QTime146/int
−
lst name=params
str name=lat41.883784/str
str name=radius50/str
str name=collapse.fieldresource_id/str
str name=rows2/str
str name=indenton/str
str name=flresource_id,geo_distance/str
str name=qTV/str
str name=qtgeo/str
str name=long-87.637668/str
/lst
/lst
−
result name=response numFound=4294 start=0
−
doc
int name=resource_id10018/int
double name=geo_distance26.16691883965225/double
/doc
−
doc
int name=resource_id10102/int
double name=geo_distance39.90588996589528/double
/doc
/result
−
lst name=collapse_counts
str name=fieldresource_id/str
−
lst name=doc
int name=10022116/int
int name=117014/int
/lst
−
lst name=count
int name=10015116/int
int name=100184/int
/lst
−
lst name=debug
str name=Docset typeBitDocSet(5201)/str
long name=Total collapsing time(ms)46/long
long name=Create uncollapsed docset(ms)22/long
long name=Collapsing normal time(ms)24/long
long name=Creating collapseinfo time(ms)0/long
long name=Convert to bitset time(ms)0/long
long name=Create collapsed docset time(ms)0/long
/lst
/lst
−
result name=response numFound=5201 start=0
−
doc
int name=resource_id10015/int
/doc
−
doc
int name=resource_id10018/int
/doc
/result
/response
  
 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
 Fix For: 1.5

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-solr-236-2.patch, 
 field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, 
 field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new 

[jira] Assigned: (SOLR-1219) use setproxy ant task when proxy properties are specified

2009-06-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned SOLR-1219:


Assignee: Koji Sekiguchi

I'll commit soon.

 use setproxy ant task when proxy properties are specified
 -

 Key: SOLR-1219
 URL: https://issues.apache.org/jira/browse/SOLR-1219
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.4

 Attachments: SOLR-1219.patch


 Currectly, ant luke and ant example will be fail if you use proxy:
 {code}
 $ ant luke
 build.xml:881: HTTP Authorization failure
 {code}
 To avoid this, use setproxy ant task when properties are specified by the 
 user:
 {code}
 $ ant luke -Dproxy.host=hostname -Dproxy.port=8080 -Dproxy.user=user 
 -Dproxy.password=passwd
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1219) use setproxy ant task when proxy properties are specified

2009-06-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1219:
-

Fix Version/s: 1.4

 use setproxy ant task when proxy properties are specified
 -

 Key: SOLR-1219
 URL: https://issues.apache.org/jira/browse/SOLR-1219
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.4

 Attachments: SOLR-1219.patch


 Currectly, ant luke and ant example will be fail if you use proxy:
 {code}
 $ ant luke
 build.xml:881: HTTP Authorization failure
 {code}
 To avoid this, use setproxy ant task when properties are specified by the 
 user:
 {code}
 $ ant luke -Dproxy.host=hostname -Dproxy.port=8080 -Dproxy.user=user 
 -Dproxy.password=passwd
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1219) use setproxy ant task when proxy properties are specified

2009-06-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved SOLR-1219.
--

Resolution: Fixed

Committed revision 785029.

 use setproxy ant task when proxy properties are specified
 -

 Key: SOLR-1219
 URL: https://issues.apache.org/jira/browse/SOLR-1219
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.4

 Attachments: SOLR-1219.patch


 Currectly, ant luke and ant example will be fail if you use proxy:
 {code}
 $ ant luke
 build.xml:881: HTTP Authorization failure
 {code}
 To avoid this, use setproxy ant task when properties are specified by the 
 user:
 {code}
 $ ant luke -Dproxy.host=hostname -Dproxy.port=8080 -Dproxy.user=user 
 -Dproxy.password=passwd
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1220) UnInvertedField performance improvement on fields with an extremely large number of values

2009-06-15 Thread Kent Fitch (JIRA)
UnInvertedField performance improvement on fields with an extremely large 
number of values
--

 Key: SOLR-1220
 URL: https://issues.apache.org/jira/browse/SOLR-1220
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Kent Fitch
Priority: Minor


Our setup is :

- about 34M lucene documents of bibliographic and full text content
- index currently 115GB, will at least double over next 6 months
- moving to support real-time-ish updates (maybe 5 min delay)

We facet on 8 fields, 6 of which are normal with small numbers of
distinct values.  But 2 faceted fields, creator and subject, are huge,
with 18M and 9M terms respectively.  

On a server with 2xquad core AMD 2382 processors and 64GB memory, java
1.6.0_13-b03, 64 bit run with -Xmx15192M -Xms6000M -verbose:gc, with
the index on Intel X25M SSD, on start-up the elapsed time to create
the 8 facets is 306 seconds (best time).  Following an index reopen,
the time to recreate them in 318 seconds (best time).

[We have made an independent experimental change to create the facets
with 3 async threads, that is, in parallel, and also to decouple them
from the underlying index, so our facets lag the index changes by the
time to recreate the facets.  With our setup, the 3 threads reduced
facet creation elapsed time from about 450 secs to around 320 secs,
but this will depend a lot on IO capabilities of the device containing
the index, amount of file system caching, load, etc]

Anyway, we noticed that huge amounts of garbage were being collected
during facet generation of the creator and subject fields, and tracked
it down to this decision in UnInvertedField univert():

 if (termNum = maxTermCounts.length) {
   // resize, but conserve memory by not doubling
   // resize at end??? we waste a maximum of 16K (average of 8K)
   int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
   System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
   maxTermCounts = newMaxTermCounts;
 }

So, we tried the obvious thing:

- allocate 10K terms initially, rather than 1K
- extend by doubling the current size, rather than adding a fixed 4K
- free unused space at the end (but only if unused space is
significant) by reallocating the array to the exact required size

And also:

- created a static HashMap lookup keyed on field name which remembers
the previous allocated size for maxTermCounts for that field, and
initially allocates that size + 1000 entries

The second change is a minor optimisation, but the first change, by
eliminating thousands of array reallocations and copies, greatly
improved load times, down from 306 to 124 seconds on the initial load
and from 318 to 134 seconds on reloads after index updates.  About
60-70 secs is still spend in GC, but it is a significant improvement.

Unless you have very large numbers of facet values, this change won't
have any positive benefit.

The core part of our change is reflected by this diff against revision 785058:

***
*** 222,232 

int termNum = te.getTermNumber();

if (termNum = maxTermCounts.length) {
! // resize, but conserve memory by not doubling
! // resize at end??? we waste a maximum of 16K (average of 8K)
! int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
  System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
  maxTermCounts = newMaxTermCounts;
}

--- 222,232 

int termNum = te.getTermNumber();

if (termNum = maxTermCounts.length) {
! // resize by doubling - for very large number of unique terms, 
expanding
! // by 4K and resultant GC will dominate uninvert times.  Resize at 
end if material
! int[] newMaxTermCounts = new int[maxTermCounts.length*2];
  System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
  maxTermCounts = newMaxTermCounts;
}

***
*** 331,338 
--- 331,346 

  numTermsInField = te.getTermNumber();
  te.close();

+ // free space if outrageously wasteful (tradeoff memory/cpu)
+
+ if ((maxTermCounts.length - numTermsInField)  1024) { // too much waste!
+   int[] newMaxTermCounts = new int[numTermsInField];
+   System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, 
numTermsInField);
+   maxTermCounts = newMaxTermCounts;
+}
+
  long midPoint = System.currentTimeMillis();

  if (termInstances == 0) {
// we didn't invert anything



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719920#action_12719920
 ] 

Shalin Shekhar Mangar commented on SOLR-1216:
-

bq. If we choose a name for the thing we are pulling, like image, then we can 
use makeimage, pullimage, etc. 

How about pullIndex?

 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719922#action_12719922
 ] 

Noble Paul commented on SOLR-1216:
--

image' gives the same idea of snapshot. it suggests that an image of the index 
exists

how about 'fetchIndex' and 'abortfetch' ?

 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1220) UnInvertedField performance improvement on fields with an extremely large number of values

2009-06-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719924#action_12719924
 ] 

Yonik Seeley commented on SOLR-1220:


Thanks Kent, but the patch was mangled by JIRA.
The normal procedure is to do an svn diff  SOLR-NNN.patch and attach that 
file to the issue via attachFile.
That also allows you to click the grant license to ASF button to help us with 
our intellectual property tracking.

 UnInvertedField performance improvement on fields with an extremely large 
 number of values
 --

 Key: SOLR-1220
 URL: https://issues.apache.org/jira/browse/SOLR-1220
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Kent Fitch
Priority: Minor

 Our setup is :
 - about 34M lucene documents of bibliographic and full text content
 - index currently 115GB, will at least double over next 6 months
 - moving to support real-time-ish updates (maybe 5 min delay)
 We facet on 8 fields, 6 of which are normal with small numbers of
 distinct values.  But 2 faceted fields, creator and subject, are huge,
 with 18M and 9M terms respectively.  
 On a server with 2xquad core AMD 2382 processors and 64GB memory, java
 1.6.0_13-b03, 64 bit run with -Xmx15192M -Xms6000M -verbose:gc, with
 the index on Intel X25M SSD, on start-up the elapsed time to create
 the 8 facets is 306 seconds (best time).  Following an index reopen,
 the time to recreate them in 318 seconds (best time).
 [We have made an independent experimental change to create the facets
 with 3 async threads, that is, in parallel, and also to decouple them
 from the underlying index, so our facets lag the index changes by the
 time to recreate the facets.  With our setup, the 3 threads reduced
 facet creation elapsed time from about 450 secs to around 320 secs,
 but this will depend a lot on IO capabilities of the device containing
 the index, amount of file system caching, load, etc]
 Anyway, we noticed that huge amounts of garbage were being collected
 during facet generation of the creator and subject fields, and tracked
 it down to this decision in UnInvertedField univert():
  if (termNum = maxTermCounts.length) {
// resize, but conserve memory by not doubling
// resize at end??? we waste a maximum of 16K (average of 8K)
int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
maxTermCounts = newMaxTermCounts;
  }
 So, we tried the obvious thing:
 - allocate 10K terms initially, rather than 1K
 - extend by doubling the current size, rather than adding a fixed 4K
 - free unused space at the end (but only if unused space is
 significant) by reallocating the array to the exact required size
 And also:
 - created a static HashMap lookup keyed on field name which remembers
 the previous allocated size for maxTermCounts for that field, and
 initially allocates that size + 1000 entries
 The second change is a minor optimisation, but the first change, by
 eliminating thousands of array reallocations and copies, greatly
 improved load times, down from 306 to 124 seconds on the initial load
 and from 318 to 134 seconds on reloads after index updates.  About
 60-70 secs is still spend in GC, but it is a significant improvement.
 Unless you have very large numbers of facet values, this change won't
 have any positive benefit.
 The core part of our change is reflected by this diff against revision 785058:
 ***
 *** 222,232 
 int termNum = te.getTermNumber();
 if (termNum = maxTermCounts.length) {
 ! // resize, but conserve memory by not doubling
 ! // resize at end??? we waste a maximum of 16K (average of 8K)
 ! int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
   System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
   maxTermCounts = newMaxTermCounts;
 }
 --- 222,232 
 int termNum = te.getTermNumber();
 if (termNum = maxTermCounts.length) {
 ! // resize by doubling - for very large number of unique terms, 
 expanding
 ! // by 4K and resultant GC will dominate uninvert times.  Resize at 
 end if material
 ! int[] newMaxTermCounts = new int[maxTermCounts.length*2];
   System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
   maxTermCounts = newMaxTermCounts;
 }
 ***
 *** 331,338 
 --- 331,346 
   numTermsInField = te.getTermNumber();
   te.close();
 + // free space if outrageously wasteful (tradeoff memory/cpu)
 +
 + if ((maxTermCounts.length - numTermsInField)  1024) { // too much 
 waste!
 +   int[]