[jira] Updated: (SOLR-1709) Distributed Date Faceting

2011-02-14 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-1709:
---

Attachment: SOLR-1709_distributed_date_faceting_v3x.patch

This is a patch for v3.1.  It includes a test. Thanks to Solr's excellent test 
infrastructure, it was actually very easy to test.

Now that said, I suspect that facet.date is going to be deprecated in favor of 
facet.range (which supports dates and numbers) -- SOLR-1240. Facet.range does 
not support faceting yet. I'll take a look at porting this patch and providing 
a test.

> Distributed Date Faceting
> -
>
> Key: SOLR-1709
> URL: https://issues.apache.org/jira/browse/SOLR-1709
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 1.4
>Reporter: Peter Sturge
>Priority: Minor
> Attachments: FacetComponent.java, FacetComponent.java, 
> ResponseBuilder.java, SOLR-1709_distributed_date_faceting_v3x.patch, 
> solr-1.4.0-solr-1709.patch
>
>
> This patch is for adding support for date facets when using distributed 
> searches.
> Date faceting across multiple machines exposes some time-based issues that 
> anyone interested in this behaviour should be aware of:
> Any time and/or time-zone differences are not accounted for in the patch 
> (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
> 'instant-in-time', unless all shards are time-synced to the exact same time).
> The implementation uses the first encountered shard's facet_dates as the 
> basis for subsequent shards' data to be merged in.
> This means that if subsequent shards' facet_dates are skewed in relation to 
> the first by >1 'gap', these 'earlier' or 'later' facets will not be merged 
> in.
> There are several reasons for this:
>   * Performance: It's faster to check facet_date lists against a single map's 
> data, rather than against each other, particularly if there are many shards
>   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
> time range larger than that which was requested
> (e.g. a request for one hour's worth of facets could bring back 2, 3 
> or more hours of data)
> This could be dealt with if timezone and skew information was added, and 
> the dates were normalized.
> One possibility for adding such support is to [optionally] add 'timezone' and 
> 'now' parameters to the 'facet_dates' map. This would tell requesters what 
> time and TZ the remote server thinks it is, and so multiple shards' time data 
> can be normalized.
> The patch affects 2 files in the Solr core:
>   org.apache.solr.handler.component.FacetComponent.java
>   org.apache.solr.handler.component.ResponseBuilder.java
> The main changes are in FacetComponent - ResponseBuilder is just to hold the 
> completed SimpleOrderedMap until the finishStage.
> One possible enhancement is to perhaps make this an optional parameter, but 
> really, if facet.date parameters are specified, it is assumed they are 
> desired.
> Comments & suggestions welcome.
> As a favour to ask, if anyone could take my 2 source files and create a PATCH 
> file from it, it would be greatly appreciated, as I'm having a bit of trouble 
> with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1240) Numerical Range faceting

2011-02-14 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994661#comment-12994661
 ] 

David Smiley commented on SOLR-1240:


Two comments:
1. I think we should let it be known that "facet.date" is deprecated in 3.1.  
That way it can be removed in a future release without waiting yet another 
release.
2. I think it's very odd that the default for the include parameter is for both 
edges to be inclusive. This means double-counting! Yes, that's how it used to 
work, but I argue it never should have worked that way and I don't think anyone 
is actually depending on this behavior. So backwards-compatibility is moot. I 
propose "lower" be the default.

> Numerical Range faceting
> 
>
> Key: SOLR-1240
> URL: https://issues.apache.org/jira/browse/SOLR-1240
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Gijs Kunze
>Assignee: Hoss Man
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.patch, 
> SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.patch, 
> SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.use-nl.patch
>
>
> For faceting numerical ranges using many facet.query query arguments leads to 
> unmanageably large queries as the fields you facet over increase. Adding the 
> same faceting parameter for numbers which already exists for dates should fix 
> this.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1969) Make MMapDirectory configurable in solrconfig.xml

2011-02-14 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994659#comment-12994659
 ] 

Bill Bell commented on SOLR-1969:
-

+1 vote.

> Make MMapDirectory configurable in solrconfig.xml
> -
>
> Key: SOLR-1969
> URL: https://issues.apache.org/jira/browse/SOLR-1969
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Stephen Bochinski
> Attachments: mmap_upd.patch, mmap_upd.patch, mmap_upd.txt, 
> mmap_upd.txt
>
>   Original Estimate: 102.5h
>  Remaining Estimate: 102.5h
>
> This will make it so you can enable mmapdirectory from the solrconfig.xml 
> file. There are also several configurations you can specify in the 
> solrconfig.xml file. You can enable or disable the unmapping files which have 
> been closed by solr. This is almost necessary for an index which is being 
> optimized. You also have the option to not mmap certain files. In this case, 
> FSDirectory will be used to manage those particular files. This is 
> particularly useful if you are using FieldCache (SOLR-1961). Having this 
> enabled makes it useless to memory map the .fdt and .fdx files, considering 
> they are already in memory.
> The configurations are specified as follows:
> 
> true
> 
>   false
>   false
> 
>   
> This would enable unmapping of closed files and would not memory map files 
> ending with .fdt and .fdx.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-14 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2155:


Attachment: (was: SOLR.2155.p2.patch)

> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch, SOLR.2155.p3.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-14 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994647#comment-12994647
 ] 

Bill Bell edited comment on SOLR-2155 at 2/15/11 3:48 AM:
--

This is the patch with some speed improvements. 

Example call:

{code}

http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc

{code}

This addresses/fixes:

3. Use DistanceUtils for hsin
4. Remove split() to improve performance

  was (Author: billnbell):
This is the patch with some speed improvements. 

Example call:

{code}

http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc

{code}
  
> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch, SOLR.2155.p2.patch, SOLR.2155.p3.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-14 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2155:


Attachment: SOLR.2155.p3.patch

This is the patch with some speed improvements. 

Example call:

http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc



> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch, SOLR.2155.p2.patch, SOLR.2155.p3.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-14 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994647#comment-12994647
 ] 

Bill Bell edited comment on SOLR-2155 at 2/15/11 3:47 AM:
--

This is the patch with some speed improvements. 

Example call:

{code}

http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc

{code}

  was (Author: billnbell):
This is the patch with some speed improvements. 

Example call:

http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc


  
> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch, SOLR.2155.p2.patch, SOLR.2155.p3.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2919) IndexSplitter that divides by primary key term

2011-02-14 Thread Jason Rutherglen (JIRA)
IndexSplitter that divides by primary key term
--

 Key: LUCENE-2919
 URL: https://issues.apache.org/jira/browse/LUCENE-2919
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Jason Rutherglen
Priority: Minor


Index splitter that divides by primary key term.  The contrib 
MultiPassIndexSplitter we have divides by docid, however to guarantee external 
constraints it's sometimes necessary to split by a primary key term id.  I 
think this implementation is a fairly trivial change.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2011-02-14 Thread JohnWu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994642#comment-12994642
 ] 

JohnWu commented on SOLR-1395:
--

Tomliu:
   Do you mean ISolrServer use katta script to direct the query core home and 
indexDirectory?
   
   now I use a katta with solr patched jar to start the subproxy, the solr home 
is set in katta script:
   
   -Dsolr.home=/home/hadoop/workspace/kattaNoZK/solrHome

   in conf folder of solrHome the solrconfig.xml is:

   

 
   explicit

 
   
   
   so the solr send the query to this core of solrHome with searchHanler (which 
use query component and return the Doclice)

As you yesterday said, I need correct the solrHome of katta script (slave node) 
direct it to query core? but how to configure the sub-proxy solr home with 
solr.searchHandler?

Thanks!

JohnWu




> Integrate Katta
> ---
>
> Key: SOLR-1395
> URL: https://issues.apache.org/jira/browse/SOLR-1395
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: Next
>
> Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
> back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
> katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
> log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
> solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
> solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
> solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
> solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
> zkclient-0.1-dev.jar, zookeeper-3.2.1.jar
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We'll integrate Katta into Solr so that:
> * Distributed search uses Hadoop RPC
> * Shard/SolrCore distribution and management
> * Zookeeper based failover
> * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: wind down for 3.1?

2011-02-14 Thread Robert Muir
On Mon, Feb 14, 2011 at 6:04 PM, Grant Ingersoll  wrote:
> I can tell you that I often stop reviewing a patch as soon as I notice it
> doesn't have tests.    In fact, I wish we could get the Hadoop Hudson
> auto-test stuff hooked in so that it would -1 patches that don't have tests.
> So, if it sorely needs to be committed, then it sorely needs tests written.
>

+1, I really wish we had this hooked in. there are just so many things
that its impossible to stay on top of, these are my 3 biggest pet
peeves coming to mind

1. javadocs warnings/errors: this is a constant battle, its worth
considering if the build should actually fail if you get one of these,
in my opinion if we can do this we really should. its frustrating to
keep the javadocs warnings down to zero, yet we shouldnt have invalid
references etc in our documentation.
2. introducing new compiler warnings: another problem just being left
for someone else to clean up later, another constant losing battle.
99% of the time (for non-autogenerated code) the warnings are
useful... in my opinion we should not commit patches that create new
warnings.
3. svn-eol style, i run a script periodically and convert all of this.
It seems a lot of people don't have their svn's configured to set this
automatically. Please configure your subversion according to apache
standards: http://www.apache.org/dev/svn-eol-style.txt, especially if
you care about being able to see line-by-line history. Otherwise my
script will periodically perform massive changes to the codebase,
including in some cases changing every single line of code in affected
files.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1969) Make MMapDirectory configurable in solrconfig.xml

2011-02-14 Thread Tim Sturge (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994609#comment-12994609
 ] 

Tim Sturge commented on SOLR-1969:
--

Hi,

Is there a plan to patch this into core SOLR? We found it gave us a 2x-3x 
speedup. Is there something we can help with here?

Secondly, there's a bug in this patch; the directories in:

+   return new FileSwitchDirectory(nonMappedFiles, mmapDir, 
+   FSDirectory.open(new File(path)), true);

are reversed (the set should contain the files to use the primary directory). 
I'm assuming from this that this patch hasn't seen widespread deployment.

Thanks,

Tim


> Make MMapDirectory configurable in solrconfig.xml
> -
>
> Key: SOLR-1969
> URL: https://issues.apache.org/jira/browse/SOLR-1969
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Stephen Bochinski
> Attachments: mmap_upd.patch, mmap_upd.patch, mmap_upd.txt, 
> mmap_upd.txt
>
>   Original Estimate: 102.5h
>  Remaining Estimate: 102.5h
>
> This will make it so you can enable mmapdirectory from the solrconfig.xml 
> file. There are also several configurations you can specify in the 
> solrconfig.xml file. You can enable or disable the unmapping files which have 
> been closed by solr. This is almost necessary for an index which is being 
> optimized. You also have the option to not mmap certain files. In this case, 
> FSDirectory will be used to manage those particular files. This is 
> particularly useful if you are using FieldCache (SOLR-1961). Having this 
> enabled makes it useless to memory map the .fdt and .fdx files, considering 
> they are already in memory.
> The configurations are specified as follows:
> 
> true
> 
>   false
>   false
> 
>   
> This would enable unmapping of closed files and would not memory map files 
> ending with .fdt and .fdx.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr-dev mailing list on Nabble

2011-02-14 Thread Yonik Seeley
On Mon, Feb 14, 2011 at 7:32 PM, Mark Miller  wrote:
> I seem to recall an email i ran into a few days back about yonik starting the 
> nabble list...let me look...

Hmmm, IIRC I just sent a request to have the lists archived and
someone from nabble handled it.
At the time I don't think there was anything about admins, etc.
I'll try digging through my mailbox and see what I come up with.

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr-dev mailing list on Nabble

2011-02-14 Thread Mark Miller
I seem to recall an email i ran into a few days back about yonik starting the 
nabble list...let me look...

Here:

> FYI, I also started archiving of solr-user/solr-dev on Nabble
> :
> 
> http://www.nabble.com/Solr-f14479.html
> 
> 
> 
> This is in addition to archives at mail-archive:
> 
> http://www.mail-archive.com/index.php?hunt=solr
> 
> 
> 
> -Yonik



On Feb 14, 2011, at 6:37 PM, Steven A Rowe wrote:

> As I mentioned on the solr-dev mailing list 
> ,
>  David Smiley's responses to emails on dev@l.a.o have been going to 
> solr-dev@l.a.o.  This is a problem, and it's not restricted to David's emails.
> 
> David Smiley's response to me in a private email, in part: "When I reply via 
> Nabble, it uses the old list which is why my emails have been sent to the 
> wrong list."
> 
> I put up a support request on Nabble: 
> http://nabble-support.1.n2.nabble.com/solr-dev-mailing-list-td6023495.html 
> and the only response so far seems to indicate that mailing lists are managed 
> by admins associated with the project with which each mailing list is 
> associated.
> 
> Who is the admin for the Solr and/or Lucene mailing lists on Nabble?
> 
> Steve

- Mark Miller
lucidimagination.com





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2286) Automatically detecting Date/Time format in the DIH

2011-02-14 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2286:
---

Affects Version/s: (was: 1.4.1)
Fix Version/s: (was: 1.4.1)
   Next

> Automatically detecting Date/Time format in the DIH
> ---
>
> Key: SOLR-2286
> URL: https://issues.apache.org/jira/browse/SOLR-2286
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
> Environment: Windows 7
>Reporter: Adam Estrada
> Fix For: Next
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ingesting several RSS/ATOM feeds, it's very laborious to format the data 
> and time for each feed. I came across a bit of Java code that may or may not 
> help alleviate some of this work.
> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
> I think that this would be a great addition to those of us who ingest a lot 
> of syndicated data and then want to query on it.
> Thanks, 
> Adam

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2258) adding FieldsNames by configuration for SignatureUpdateProcessorFactory

2011-02-14 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2258:
---

Fix Version/s: (was: 1.4)
   Next

> adding FieldsNames by configuration for SignatureUpdateProcessorFactory
> ---
>
> Key: SOLR-2258
> URL: https://issues.apache.org/jira/browse/SOLR-2258
> Project: Solr
>  Issue Type: Wish
>Reporter: Bernd Fehling
>Priority: Trivial
> Fix For: Next
>
> Attachments: SOLR-2258.patch
>
>
> I would like to suggest having the "fields" names from configuration added to 
> signature by configuration with default to true.
> There are usecases where only the signature of the content is required.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2106) Spelling Checking for Multiple Fields

2011-02-14 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2106:
---

Affects Version/s: (was: 1.4)
Fix Version/s: (was: 1.4)
   Next
   Issue Type: New Feature  (was: Bug)

> Spelling Checking for Multiple Fields
> -
>
> Key: SOLR-2106
> URL: https://issues.apache.org/jira/browse/SOLR-2106
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
> Environment: Linux Environment
>Reporter: JAYABAALAN V
> Fix For: Next
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Need to enable spellchecking for five different field and it's 
> configuration.I am using dismax query parser for searching the different 
> fields in the simple.If user has entered a wrong spelling in the front end.It 
> should check in the five different fields and give collate spelling 
> suggestion in the front end and should get a result based on the spelling 
> suggestion.Do provide your configuration details for the same...

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1997) analyzed field: Store internal value instead of input one

2011-02-14 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-1997:
---

Affects Version/s: (was: 1.4.1)
   (was: 1.5)
   (was: 1.4)
Fix Version/s: (was: 1.4.1)
   (was: 1.5)
   (was: 1.4)
   Next

> analyzed field: Store internal value instead of input one
> -
>
> Key: SOLR-1997
> URL: https://issues.apache.org/jira/browse/SOLR-1997
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joan Codina
> Fix For: Next
>
> Attachments: SOLR-1997-1.4.patch, SOLR-1997-1.5.patch
>
>
> Solr implements a set of filters and tokenizers that allow the filtering and 
> treatment of text, but when the field is set to be stored, the text stored is 
> the input one. This is may useful when the end user reads the input, but may 
> not be like this in others, cases, when for example there are payloads and 
> the text is something like A|2.0 good|1.0 day|3.0, or if the result of a 
> query is processed using something like Carrot2
> So this is a simple new kind of field that takes as input the output of a 
> given type (source), and then performs the normal processing with the desired 
> tokenizers and filters . The difference is that the stored value is the 
> output of the source type, and this is what is retrieved when getting the 
> document.
> The name of the field type  is AnalyzedField and in the schema is introduced 
> in the following way to create the analyzedSourceType from the  SourceType
>   
>   
>class="solr.StandardTokenizerFactory" />
>   
>   
>   
>class="solr.StandardTokenizerFactory" />
>   
>   
>   
>   positionIncrementGap="100" preProcessType="SourceType">
>  
>  
>
>  
> many times just the WhitespaceTokenizerFactory  is needed as the tokens have 
> already been cut down by the  SourceType
> finally, a field can be declared as 
>  stored="true" termVectors="true" multiValued="true"/>
> which can be written directly or can be defined as a copy of the source one.
>  termVectors="true" multiValued="true"/>
> ...
> 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Problem loading jcc from java : undefined symbol: PyExc_IOError

2011-02-14 Thread Roman Chyla
Hello Andi, all,

The python embedded in Java works really well on MacOsX and also
Ubuntu. But I am trying hard to make it work also on Scientific Linux
(SLC5) with *statically* built Python. The python is a build from
ActiveState.

So far, I managed to build all the needed extensions (jcc, lucene,
solr) and I can run them in python, but when I try to start the java
app and use python, I get:

SEVERE: org.apache.jcc.PythonException:
/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib-dynload/time.so:
undefined symbol: PyExc_IOError


I understand, that the missing symbol PyExc_IOError is in the static
python library:

bash-3.2$ nm 
/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config/libpython2.5.a
| grep IOError
4120 D PyExc_IOError
4140 d _PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError

So when building JCC, I build with these arguments:

lflags  +  ['-lpython%s.%s' %(sys.version_info[0:2]),
'-L',
'/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config',
'-rdynamic',
'-Wl,--export-dynamic',
'-Xlinker',
'--export-dynamic']

I just found instructions at:
http://stackoverflow.com/questions/4223312/python-interpreter-embedded-in-the-application-fails-to-load-native-modules
I don't really understand g++, but the symbol is there after the compilation

bash-3.2$ nm 
/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg/libjcc.so
| grep IOError
00352240 D PyExc_IOError
00352260 d _PyExc_IOError

And when starting java, I do
"-Djava.library.path=/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg"

The code works find on mac (python 2.6) and ubuntu (python2.6), but
not this statically linked python2.5 - would you know what I can try?

Thanks.


  roman


PS: I tried several compilations, but I was usually re-compiling JCC
without building lucene etc again, I hope that is not the problem.


[jira] Commented: (SOLR-1711) Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java

2011-02-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994562#comment-12994562
 ] 

Yonik Seeley commented on SOLR-1711:


bq. So there might be a race condition on the queue capacity check.

Yeah.  What about moving the queue.put() inside the synchronized(runners) block 
to fix this?

> Race condition in 
> org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java
> --
>
> Key: SOLR-1711
> URL: https://issues.apache.org/jira/browse/SOLR-1711
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java
>Affects Versions: 1.4, 1.5
>Reporter: Attila Babo
>Assignee: Yonik Seeley
>Priority: Critical
> Fix For: 1.4.1, 1.5, 3.1, 4.0
>
> Attachments: StreamingUpdateSolrServer.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> While inserting a large pile of documents using StreamingUpdateSolrServer 
> there is a race condition as all Runner instances stop processing while the 
> blocking queue is full. With a high performance client this could happen 
> quite often, there is no way to recover from it at the client side.
> In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
> UpdateRequests, there are up to threadCount number of workers threads from 
> StreamingUpdateSolrServer.Runner to read that queue and push requests to a 
> Solr instance. If at one point the BlockingQueue is empty all workers stop 
> processing it and pushing the collected content to Solr which could be a time 
> consuming process, sometimes all worker threads are waiting for Solr. If at 
> this time the client fills the BlockingQueue to full all worker threads will 
> quit without processing any further and the main thread will block forever.
> There is a simple, well tested patch attached to handle this situation.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr-dev mailing list on Nabble

2011-02-14 Thread Steven A Rowe
As I mentioned on the solr-dev mailing list 
, 
David Smiley's responses to emails on dev@l.a.o have been going to 
solr-dev@l.a.o.  This is a problem, and it's not restricted to David's emails.

David Smiley's response to me in a private email, in part: "When I reply via 
Nabble, it uses the old list which is why my emails have been sent to the wrong 
list."

I put up a support request on Nabble: 
http://nabble-support.1.n2.nabble.com/solr-dev-mailing-list-td6023495.html and 
the only response so far seems to indicate that mailing lists are managed by 
admins associated with the project with which each mailing list is associated.

Who is the admin for the Solr and/or Lucene mailing lists on Nabble?

Steve


Re: wind down for 3.1?

2011-02-14 Thread Grant Ingersoll

On Feb 12, 2011, at 7:38 PM, David Smiley (@MITRE.org) wrote:
> 
> One that comes to mind (and to several others I know) is SOLR-1709
> Distributed date faceting. This has had working code for a long time, though
> admittedly not a proper patch nor tests.  That issue sorely needs to get
> committed IMO.

I can tell you that I often stop reviewing a patch as soon as I notice it 
doesn't have tests.In fact, I wish we could get the Hadoop Hudson auto-test 
stuff hooked in so that it would -1 patches that don't have tests.

So, if it sorely needs to be committed, then it sorely needs tests written.  


>  And then, it may not qualify as a "bug", but a release is an
> opportunity to tidy up the /browse interface.  I tried to use it recently in
> 3x and it felt half-baked.

Trunk is much further along for /browse.  It probably could stand to be 
backported.

-Grant

[jira] Resolved: (SOLR-2342) Lock starvation can cause commit to never run when many clients are adding docs

2011-02-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved SOLR-2342.
--

   Resolution: Fixed
Fix Version/s: 4.0
 Assignee: Michael McCandless

> Lock starvation can cause commit to never run when many clients are adding 
> docs
> ---
>
> Key: SOLR-2342
> URL: https://issues.apache.org/jira/browse/SOLR-2342
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1, 4.0
>
>
> I have a stress test, where 100 clients add 100 1MB docs and then call commit 
> in the end.  It's a falldown test (try to make Solr fall down) and nowhere 
> near "actual" usage.
> But, after some initial commits that succeed, I'm seeing later commits always 
> time out (client side timeout @ 10 minutes).  Watching Solr's logging, no 
> commit ever runs.
> Looking at the stack traces in the threads, this is not deadlock: the 
> add/update calls are running, and new segments are being flushed to the index.
> Digging in the code a bit, we use ReentrantReadWriteLock, with add/update 
> acquiring the readLock and commit acquiring the writeLock.  But, according to 
> the jdocs, the writeLock isn't given any priority over the readLock (unless 
> you set fairness, which we don't).  So I think this explains the starvation?
> However, this is not a real world use case (most apps would/should call 
> commit less often, and from on client).  Also, we could set fairness, but it 
> seems to have some performance penalty, and I'm not sure we should penalize 
> the "normal" case for this unusual one.  EG see here (thanks Mark): 
> http://www.javaspecialists.eu/archive/Issue165.html.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2342) Lock starvation can cause commit to never run when many clients are adding docs

2011-02-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994544#comment-12994544
 ] 

Michael McCandless commented on SOLR-2342:
--

Thanks Mark; I'll commit shortly...

> Lock starvation can cause commit to never run when many clients are adding 
> docs
> ---
>
> Key: SOLR-2342
> URL: https://issues.apache.org/jira/browse/SOLR-2342
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
>
> I have a stress test, where 100 clients add 100 1MB docs and then call commit 
> in the end.  It's a falldown test (try to make Solr fall down) and nowhere 
> near "actual" usage.
> But, after some initial commits that succeed, I'm seeing later commits always 
> time out (client side timeout @ 10 minutes).  Watching Solr's logging, no 
> commit ever runs.
> Looking at the stack traces in the threads, this is not deadlock: the 
> add/update calls are running, and new segments are being flushed to the index.
> Digging in the code a bit, we use ReentrantReadWriteLock, with add/update 
> acquiring the readLock and commit acquiring the writeLock.  But, according to 
> the jdocs, the writeLock isn't given any priority over the readLock (unless 
> you set fairness, which we don't).  So I think this explains the starvation?
> However, this is not a real world use case (most apps would/should call 
> commit less often, and from on client).  Also, we could set fairness, but it 
> seems to have some performance penalty, and I'm not sure we should penalize 
> the "normal" case for this unusual one.  EG see here (thanks Mark): 
> http://www.javaspecialists.eu/archive/Issue165.html.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2918) IndexWriter should prune 100% deleted segs even in the NRT case

2011-02-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2918:
---

Attachment: LUCENE-2918.patch

Patch.

> IndexWriter should prune 100% deleted segs even in the NRT case
> ---
>
> Key: LUCENE-2918
> URL: https://issues.apache.org/jira/browse/LUCENE-2918
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-2918.patch
>
>
> We now prune 100% deleted segs on commit from IW or IR (LUCENE-2010),
> but this isn't quite aggressive enough, because in the NRT case you
> rarely call commit.
> Instead, the moment we delete the last doc of a segment, it should be
> pruned from the in-memory segmentInfos.  This way, if you open an NRT
> reader, or a merge kicks off, or commit is called, the 100% deleted
> segment is already gone.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2918) IndexWriter should prune 100% deleted segs even in the NRT case

2011-02-14 Thread Michael McCandless (JIRA)
IndexWriter should prune 100% deleted segs even in the NRT case
---

 Key: LUCENE-2918
 URL: https://issues.apache.org/jira/browse/LUCENE-2918
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0


We now prune 100% deleted segs on commit from IW or IR (LUCENE-2010),
but this isn't quite aggressive enough, because in the NRT case you
rarely call commit.

Instead, the moment we delete the last doc of a segment, it should be
pruned from the in-memory segmentInfos.  This way, if you open an NRT
reader, or a merge kicks off, or commit is called, the 100% deleted
segment is already gone.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1553) extended dismax query parser

2011-02-14 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-1553:


Attachment: edismax.unescapedcolon.bug.test.patch

+1 to mark as experimental in 3.1



The innards of how this works it totally greek, but tried finding somethign to 
fix hoss' unescaped patch.  It seems that the root of the problem is that 
QueryParserBase.parse( String ) will return a BooleanQuery with no clauses for 
the invalid field query.
{code:java}
Query res = TopLevelQuery(field);
return res!=null ? res : newBooleanQuery(false);
{code}
Then the edismax just checks if the parsedQuery is null to see if it is valid.

I tried just returning null from the QueryParserBase, but that (not 
surprisingly) breaks other tests like TestMultiFieldQueryParser.  I imagine 
somethign changed here for why it used to work, and now "mysteriously" does 
not.  

Adding a check for empty BooleanQuery fixes this in edismax though:
{code:java}
 if( parsedUserQuery instanceof BooleanQuery ) {
   if( ((BooleanQuery)parsedUserQuery).getClauses().length < 1 ) {
 parsedUserQuery = null;
   }
 }
{code}

All tests pass... but can someone who knows what the ramifications of this 
change means take a look?

> extended dismax query parser
> 
>
> Key: SOLR-1553
> URL: https://issues.apache.org/jira/browse/SOLR-1553
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 1.5, 3.1, 4.0
>
> Attachments: SOLR-1553.patch, SOLR-1553.pf-refactor.patch, 
> edismax.unescapedcolon.bug.test.patch, edismax.unescapedcolon.bug.test.patch, 
> edismax.userFields.patch
>
>
> An improved user-facing query parser based on dismax

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2917) callers should be able to advance()/jump() without fear

2011-02-14 Thread Robert Muir (JIRA)
callers should be able to advance()/jump() without fear
---

 Key: LUCENE-2917
 URL: https://issues.apache.org/jira/browse/LUCENE-2917
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: Bulk Postings branch


Currently, in various places in the code (TermScorer, ExactPhraseScorer) there 
are optimizations 
that assume advance/jump is heavy, and for short doc-distances etc, they next() 
their way instead.

This sort of logic should instead be in the codec: jump/advance should always 
be fast.
Its the codecs responsibility to make this happen: jump/advance need not 
involve using skip data.

For example: in the fixed layout from LUCENE-2905, various forms of 
block-skipping can take place
to do this operation without skip data (this is implemented in its docs and 
docsAndPositionsEnums,
but not yet its bulk postings enums).

For block codecs, they should always avoid trying to skip if the target is 
likely within-block,
and if the target is likely only a few blocks away, it can still be faster not 
to skip, as skipping
out of block requires several fills. In the fixed layout we can do these sort 
of 'fast scans' where
in the docsenum case, we keep the freqs buffer one step behind the docs buffer, 
skipping it when
we pass over it, and only filling freqs a single time at the end... in the 
docsandpositions case
we can do the exact same thing with positions.

I think as part of this, we should tighten the API for the bulkpostings jump, 
it should require the 
current doc (the old enums knew this implicitly) to allow for different jump 
impls. For positions 
i think its at least fair to require the caller to pass in the pending 
positions count.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-14 Thread Yonik Seeley
On Mon, Feb 14, 2011 at 10:04 AM, Alex Cowell  wrote:
> There seem to be some nuances which we have yet to encounter/discover like
> the way you've implemented the processCommit() method to wait for all the
> adds/deletes to complete before sending the commits. Are these things which
> you were aware of in advance that would need to be dealt with?

Yeah, it really just has to do with the fact that I was using multiple
threads to send update commands to other nodes.
This means that if you do an add and then a delete of the same doc,
those could get reordered to do the delete first, and then the add.
And the commit at the end could sneak in front of some adds and
deletes still in progress on other threads.

For true distributed indexing, I think we'll want a version number
somehow (perhaps based on timestamp by default) so updates can be
ordered, and all nodes can agree on the ordering.  For example, one
client could update node A with doc X, and a different client could
update node B with doc X.  If that happens very close together, we
need all shard replicas to agree which doc X will win.

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2905) Sep codec writes insane amounts of skip data

2011-02-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994370#comment-12994370
 ] 

Robert Muir commented on LUCENE-2905:
-

OK I committed this for now to the branch (r1070580), we can always revert it.

I cutover FOR, PFOR, and PFOR2 to use this, and all tests pass with all these 
codecs.


> Sep codec writes insane amounts of skip data
> 
>
> Key: LUCENE-2905
> URL: https://issues.apache.org/jira/browse/LUCENE-2905
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Robert Muir
> Fix For: Bulk Postings branch
>
> Attachments: LUCENE-2905_intblock.patch, 
> LUCENE-2905_interleaved.patch, LUCENE-2905_simple64.patch, 
> LUCENE-2905_skipIntervalMin.patch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable 
> Intblock
> encodings, we have problems with both performance and index size versus 
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset 
> into the block.
> * In a lot of cases various numbers involved are larger than they should be: 
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try 
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, 
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into 
> one file? the concrete impl could implement skipBlock() for when they only 
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could 
> just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read 
> their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such 
> as 64 or 128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but 
> instead completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to 
> the start of a block? Then, their skip pointers dont need to be a combination 
> of delta + upto, because upto is always zero. What would we have to modify in 
> the bulkpostings api for jump() to work with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based 
> on block size (maybe by default its 1, meaning we can skip to the start of a 
> every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 
> 'blockid' instead of file pointer so that we get smaller numbers? e.g. 
> simple64 can do blockid * 8 to get to the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be 
> blockid deltas from the index term, instead of fp deltas? This would be 
> smaller numbers and we could compress this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-02-14 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994331#comment-12994331
 ] 

Steven Rowe commented on LUCENE-2611:
-

bq. I couldn't compile most of the non-english languages when I updated my 3x 
branch recently. 

David,

IntelliJ has a project-wide encoding setting: Settings - Project Settings | 
File Encoding (see http://jetbrains.dzone.com/articles/new-approach-encoding 
for details).  Mine is set to UTF-8.  Is yours set to something different?  I 
suspect that IntelliJ automatically uses this setting when invoking {{javac}}, 
and that's why I never had this problem.

> IntelliJ IDEA and Eclipse setup
> ---
>
> Key: LUCENE-2611
> URL: https://issues.apache.org/jira/browse/LUCENE-2611
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2611-branch-3x-part2.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test_2.patch, utf8.patch
>
>
> Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
> The attached patches add a new top level directory {{dev-tools/}} with 
> sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
> as well as top-level ant targets named "idea" and "eclipse" that copy these 
> files into the proper locations.  This arrangement avoids the messiness 
> attendant to in-place project configuration files directly checked into 
> source control.
> The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
> Solr contrib, and each analysis module.  A JUnit run configuration per module 
> is included.
> The Eclipse configuration includes a source entry for each 
> source/test/resource location and classpath setup: a library entry for each 
> jar.
> For IDEA, once {{ant idea}} has been run, the only configuration that must be 
> performed manually is configuring the project-level JDK.  For Eclipse, once 
> {{ant eclipse}} has been run, the user has to refresh the project 
> (right-click on the project and choose Refresh).
> If these patches is committed, Subversion svn:ignore properties should be 
> added/modified to ignore the destination IDEA and Eclipse configuration 
> locations.
> Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
> for applying the 3.X branch patch for IDEA: 
> http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2358) Distributing Indexing

2011-02-14 Thread Alex Cowell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Cowell updated SOLR-2358:
--

Attachment: SOLR-2358.patch

Added a patch which handles distributed add and commit update requests.

Please see the 'Distributed Indexing' thread on the dev mailing list for more 
info.

> Distributing Indexing
> -
>
> Key: SOLR-2358
> URL: https://issues.apache.org/jira/browse/SOLR-2358
> Project: Solr
>  Issue Type: New Feature
>Reporter: William Mayor
>Priority: Minor
> Attachments: SOLR-2358.patch
>
>
> The first steps towards creating distributed indexing functionality in Solr

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-14 Thread Alex Cowell
I've uploaded a patch of what we've done so far:

https://issues.apache.org/jira/browse/SOLR-2358

It's still very much work in progress and there are some obvious issues
which are being resolved at the moment (such as the inefficient method of
waiting for all the docs to be processed before distributing them in one
batch and handling shard replicas), but any feedback is welcomed.

As it stands, you can distribute add and commit requests using the
HashedDistributionPolicy by simply specifying a 'shards' request parameter.
Using a user specified distribution policy (either as a param in the URL or
defined in the solrconfig as Upayavira suggested) is in the works as well.
Regarding that, I figure the priority for determining which policy to use
would be (highest to lowest):

1. Param in the URL
2. Specified in the solrconfig
3. Hard-coded default to fall back on

That way if a user changed their mind about which distribution policy they
wanted to use, they could override the default policy with their chosen one
as a request parameter.

The code has only been acceptance tested at the moment. There is a test
class but it's a bit messy, so once that's tidied up and improved a little
more I'll include it in the next patch.


> I haven't had time to follow all of this discussion, but this issue might
> help:
> https://issues.apache.org/jira/browse/SOLR-2355
>

Thanks - very interesting! It's reassuring to see our implementation has
been following a similar structure.

There seem to be some nuances which we have yet to encounter/discover like
the way you've implemented the processCommit() method to wait for all the
adds/deletes to complete before sending the commits. Are these things which
you were aware of in advance that would need to be dealt with?

Alex


[jira] Issue Comment Edited: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-02-14 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994323#comment-12994323
 ] 

Steven Rowe edited comment on LUCENE-2611 at 2/14/11 2:59 PM:
--

*edit*: escaped the {{\*}} metacharacter so that it appears as itself rather 
than switching to bold format.

bq. For some reason it was missed that javac should be invoked with "-encoding 
utf-8" options.

Thanks for bringing this up - I hadn't realized that javac defaulted to the 
platform encoding.  I've committed this part of your patch to both trunk and 
branch_3x.

bq. I also included a build.xml tweak that overwrites the .iml files

I didn't commit this part of your patch, because I don't think it's a good idea 
to only overwrite the {{.iml}} files, and not also the {{.idea/.xml}} files, 
since the two have to be in sync, e.g. when there are structural changes.  

Right now there is a {{clean-idea}} task that can be used to serve this 
function.  The {{.idea}} directory is where IntelliJ stores stuff like shelved 
changes, and it would be really bad to automatically delete that stuff as part 
of updating IntelliJ configuration, so that directory is never automatically 
overwritten.

One technical note about this part of your patch: 
{code:xml}

  

+   
  
  
+ 
+   
+ 
+   
+ 
{code}

{{\*.iml}} does not refer to all {{.iml}} files in all sub-directories, but 
rather only those in the top-level directory.  You want {{\*\*/\*.iml}} to 
catch all of them recursively.


  was (Author: steve_rowe):
bq. For some reason it was missed that javac should be invoked with 
"-encoding utf-8" options.

Thanks for bringing this up - I hadn't realized that javac defaulted to the 
platform encoding.  I've committed this part of your patch to both trunk and 
branch_3x.

bq. I also included a build.xml tweak that overwrites the .iml files

I didn't commit this part of your patch, because I don't think it's a good idea 
to only overwrite the {{*.iml}} files, and not also the {{.idea/*.xml}} files, 
since the two have to be in sync, e.g. when there are structural changes.  

Right now there is a {{clean-idea}} task that can be used to serve this 
function.  The {{.idea}} directory is where IntelliJ stores stuff like shelved 
changes, and it would be really bad to automatically delete that stuff as part 
of updating IntelliJ configuration, so that directory is never automatically 
overwritten.

One technical note about this part of your patch: 
{code:xml}

  

+   
  
  
+ 
+   
+ 
+   
+ 
{code}

{{*.iml}} does not refer to all {{.iml}} files in all sub-directories, but 
rather only those in the top-level directory.  You want {{**/*.iml}} to catch 
all of them recursively.

  
> IntelliJ IDEA and Eclipse setup
> ---
>
> Key: LUCENE-2611
> URL: https://issues.apache.org/jira/browse/LUCENE-2611
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2611-branch-3x-part2.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test_2.patch, utf8.patch
>
>
> Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
> The attached patches add a new top level directory {{dev-tools/}} with 
> sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
> as well as top-level ant targets named "idea" and "eclipse" that copy these 
> files into the proper locations.  This arrangement avoids the messiness 
> attendant to in-place project configuration files directly checked into 
> source control.
> The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
> Solr contrib, and each analysis module.  A JUnit run configuration per module 
> is included.
> The Eclipse configuration includes a source entry for each 
> source/test/resource location and classpath setup: a library entry for each 
> jar.
> For IDEA, once {{ant idea}} has been run, the only configuration that must be 
> performed manually is configuring the project-level JDK.  For Eclipse, once 
> {{ant eclipse}} has been run, the user has to refresh the project 
> (right-click on the project and choose Refresh).
> If these patches is committed, Subversion svn:ignore properties should be 
> ad

[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-02-14 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994323#comment-12994323
 ] 

Steven Rowe commented on LUCENE-2611:
-

bq. For some reason it was missed that javac should be invoked with "-encoding 
utf-8" options.

Thanks for bringing this up - I hadn't realized that javac defaulted to the 
platform encoding.  I've committed this part of your patch to both trunk and 
branch_3x.

bq. I also included a build.xml tweak that overwrites the .iml files

I didn't commit this part of your patch, because I don't think it's a good idea 
to only overwrite the {{*.iml}} files, and not also the {{.idea/*.xml}} files, 
since the two have to be in sync, e.g. when there are structural changes.  

Right now there is a {{clean-idea}} task that can be used to serve this 
function.  The {{.idea}} directory is where IntelliJ stores stuff like shelved 
changes, and it would be really bad to automatically delete that stuff as part 
of updating IntelliJ configuration, so that directory is never automatically 
overwritten.

One technical note about this part of your patch: 
{code:xml}

  

+   
  
  
+ 
+   
+ 
+   
+ 
{code}

{{*.iml}} does not refer to all {{.iml}} files in all sub-directories, but 
rather only those in the top-level directory.  You want {{**/*.iml}} to catch 
all of them recursively.


> IntelliJ IDEA and Eclipse setup
> ---
>
> Key: LUCENE-2611
> URL: https://issues.apache.org/jira/browse/LUCENE-2611
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2611-branch-3x-part2.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test_2.patch, utf8.patch
>
>
> Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
> The attached patches add a new top level directory {{dev-tools/}} with 
> sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
> as well as top-level ant targets named "idea" and "eclipse" that copy these 
> files into the proper locations.  This arrangement avoids the messiness 
> attendant to in-place project configuration files directly checked into 
> source control.
> The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
> Solr contrib, and each analysis module.  A JUnit run configuration per module 
> is included.
> The Eclipse configuration includes a source entry for each 
> source/test/resource location and classpath setup: a library entry for each 
> jar.
> For IDEA, once {{ant idea}} has been run, the only configuration that must be 
> performed manually is configuring the project-level JDK.  For Eclipse, once 
> {{ant eclipse}} has been run, the user has to refresh the project 
> (right-click on the project and choose Refresh).
> If these patches is committed, Subversion svn:ignore properties should be 
> added/modified to ignore the destination IDEA and Eclipse configuration 
> locations.
> Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
> for applying the 3.X branch patch for IDEA: 
> http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [HUDSON] Lucene-Solr-tests-only-trunk - Build # 4884 - Failure

2011-02-14 Thread Robert Muir
I committed a fix for this.

On Mon, Feb 14, 2011 at 9:07 AM, Apache Hudson Server
 wrote:
> Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4884/
>
> 1 tests failed.
> REGRESSION:  org.apache.lucene.search.TestMultiTermConstantScore.testBoost
>
> Error Message:
> expected:<1> but was:<2>
>
> Stack Trace:
> junit.framework.AssertionFailedError: expected:<1> but was:<2>
>        at 
> org.apache.lucene.search.TestMultiTermConstantScore.testBoost(TestMultiTermConstantScore.java:223)
>        at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
>        at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
>
>
>
>
> Build Log (for compile errors):
> [...truncated 3055 lines...]
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2905) Sep codec writes insane amounts of skip data

2011-02-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994304#comment-12994304
 ] 

Robert Muir commented on LUCENE-2905:
-

seems hudson just hit the "long-tail random fail" i was talking about, in trunk.

so i don't think its a problem with this codec: 
https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4884/


> Sep codec writes insane amounts of skip data
> 
>
> Key: LUCENE-2905
> URL: https://issues.apache.org/jira/browse/LUCENE-2905
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Robert Muir
> Fix For: Bulk Postings branch
>
> Attachments: LUCENE-2905_intblock.patch, 
> LUCENE-2905_interleaved.patch, LUCENE-2905_simple64.patch, 
> LUCENE-2905_skipIntervalMin.patch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable 
> Intblock
> encodings, we have problems with both performance and index size versus 
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset 
> into the block.
> * In a lot of cases various numbers involved are larger than they should be: 
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try 
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, 
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into 
> one file? the concrete impl could implement skipBlock() for when they only 
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could 
> just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read 
> their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such 
> as 64 or 128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but 
> instead completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to 
> the start of a block? Then, their skip pointers dont need to be a combination 
> of delta + upto, because upto is always zero. What would we have to modify in 
> the bulkpostings api for jump() to work with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based 
> on block size (maybe by default its 1, meaning we can skip to the start of a 
> every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 
> 'blockid' instead of file pointer so that we get smaller numbers? e.g. 
> simple64 can do blockid * 8 to get to the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be 
> blockid deltas from the index term, instead of fp deltas? This would be 
> smaller numbers and we could compress this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4884 - Failure

2011-02-14 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4884/

1 tests failed.
REGRESSION:  org.apache.lucene.search.TestMultiTermConstantScore.testBoost

Error Message:
expected:<1> but was:<2>

Stack Trace:
junit.framework.AssertionFailedError: expected:<1> but was:<2>
at 
org.apache.lucene.search.TestMultiTermConstantScore.testBoost(TestMultiTermConstantScore.java:223)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)




Build Log (for compile errors):
[...truncated 3055 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2334) solr. icu4j for Unicode Normalization

2011-02-14 Thread ahmad maher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994269#comment-12994269
 ] 

ahmad maher edited comment on SOLR-2334 at 2/14/11 11:48 AM:
-

can you give an example - how can i use it in solr schema file ?

  was (Author: amd_maher):
can you give an example ?
  
> solr. icu4j for Unicode Normalization
> -
>
> Key: SOLR-2334
> URL: https://issues.apache.org/jira/browse/SOLR-2334
> Project: Solr
>  Issue Type: Test
>  Components: clients - java
>Affects Versions: 1.4
> Environment: debian lenny and squeez  , 1386 arch
>Reporter: ahmad maher
> Fix For: 1.4.2
>
>
> Dears,
> i use icu4j for UnicodeNormalization in schema.xml like that
> "
>  version="icu4j" composed="false" remove_diacritics="true" 
> remove_modifiers="true" fold="true"/>
> "
> and if i use any token except English tokens in filter class ,  it return 
> error,  like in using solr.PatternReplaceFilterFactory
> how can i use :
> transliterate rule  and transform rule in solr schema or config file ?
> as mentioned here http://userguide.icu-project.org/transforms/general
> can any one help me ?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2334) solr. icu4j for Unicode Normalization

2011-02-14 Thread ahmad maher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ahmad maher updated SOLR-2334:
--

Comment: was deleted

(was: thank you for replay,

how can use Pattern Replace  OR char map or replace in my solr schema file
using the ICU4j for non English patterns or chars ?





)

> solr. icu4j for Unicode Normalization
> -
>
> Key: SOLR-2334
> URL: https://issues.apache.org/jira/browse/SOLR-2334
> Project: Solr
>  Issue Type: Test
>  Components: clients - java
>Affects Versions: 1.4
> Environment: debian lenny and squeez  , 1386 arch
>Reporter: ahmad maher
> Fix For: 1.4.2
>
>
> Dears,
> i use icu4j for UnicodeNormalization in schema.xml like that
> "
>  version="icu4j" composed="false" remove_diacritics="true" 
> remove_modifiers="true" fold="true"/>
> "
> and if i use any token except English tokens in filter class ,  it return 
> error,  like in using solr.PatternReplaceFilterFactory
> how can i use :
> transliterate rule  and transform rule in solr schema or config file ?
> as mentioned here http://userguide.icu-project.org/transforms/general
> can any one help me ?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2334) solr. icu4j for Unicode Normalization

2011-02-14 Thread ahmad maher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994269#comment-12994269
 ] 

ahmad maher commented on SOLR-2334:
---

can you give an example ?

> solr. icu4j for Unicode Normalization
> -
>
> Key: SOLR-2334
> URL: https://issues.apache.org/jira/browse/SOLR-2334
> Project: Solr
>  Issue Type: Test
>  Components: clients - java
>Affects Versions: 1.4
> Environment: debian lenny and squeez  , 1386 arch
>Reporter: ahmad maher
> Fix For: 1.4.2
>
>
> Dears,
> i use icu4j for UnicodeNormalization in schema.xml like that
> "
>  version="icu4j" composed="false" remove_diacritics="true" 
> remove_modifiers="true" fold="true"/>
> "
> and if i use any token except English tokens in filter class ,  it return 
> error,  like in using solr.PatternReplaceFilterFactory
> how can i use :
> transliterate rule  and transform rule in solr schema or config file ?
> as mentioned here http://userguide.icu-project.org/transforms/general
> can any one help me ?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-14 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994256#comment-12994256
 ] 

Bill Bell commented on SOLR-2155:
-

I did more research. You cannot get from doc to multiple values in the cache 
for a field. It does not exist for what I can see. The "docToTermOrd" property 
(type Direct8) is an array that is indexed by the document ID, and has one 
value (the term ord). It does not appear to be easy to get a list since there 
is one value. This was created to easily count the number of documents for 
facets (does it have 1 or more). I could do something like the following (but 
it would be really slow).

Document doc = searcher.doc(id, fields);

It would be better if you copied each lat long into the index with a prefix 
added to the sfield. Like "store_1", "store_2", "store_3", when you index the 
values. Then I can grab them easily. Of course you could also just sore them in 
one field like that I did but name it store_1 : "lat,lon|lat,lon". If we did 
this during indexing it would make it easier for people to use (not having to 
copy it) with bars. Asking for 2,3,4 term lists by document ID is probably 
slower than just doing the "|" separation. 

I keep going back to my patch, and I think it is still pretty good. I hope 
others have not went down this same path, since it was not fun.

Improvements potential:

1. Auto populate sfieldmulti when indexing geohash field into "|"
2. Multi-thread the brute force looking for lat longs
3. Use DistanceUtils for hsin
4. Remove split() to improve performance

Bill



> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch, SOLR.2155.p2.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2011-02-14 Thread tom liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994255#comment-12994255
 ] 

tom liu commented on SOLR-1395:
---

the ISolrServer is handled by Katta-Node, configures be:
# solrconfig.xml: which is used by ISolrServer's Default SolrCore
# katta script: which is used to tell iSolrServer's SolrHome.

Katta's Script:[On katta node, but not on katta master]
{noformat}
KATTA_OPTS="$KATTA_OPTS -Dsolr.home=/var/data/solr 
-Dsolr.directoryFactory=solr.MMapDirectoryFactory"
{noformat}

Katta startup node, the IsolrServer will be got solr.home and 
solr.directoryFactory
and then, ISolrServer's Default SolrCore will use those env to hold solrcore.

> Integrate Katta
> ---
>
> Key: SOLR-1395
> URL: https://issues.apache.org/jira/browse/SOLR-1395
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: Next
>
> Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
> back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
> katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
> log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
> solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
> solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
> solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
> solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
> zkclient-0.1-dev.jar, zookeeper-3.2.1.jar
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We'll integrate Katta into Solr so that:
> * Distributed search uses Hadoop RPC
> * Shard/SolrCore distribution and management
> * Zookeeper based failover
> * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2334) solr. icu4j for Unicode Normalization

2011-02-14 Thread ahmad maher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992418#comment-12992418
 ] 

ahmad maher edited comment on SOLR-2334 at 2/14/11 9:32 AM:


thank you for replay,

how can use Pattern Replace  OR char map or replace in my solr schema file
using the ICU4j for non English patterns or chars ?







  was (Author: amd_maher):
thank you for replay,

how can use Pattern Replace  OR char map or replace in my solr schema file
using the ICU4j for non English patterns or chars ?
OR 
adding Arabic Normalization






  
> solr. icu4j for Unicode Normalization
> -
>
> Key: SOLR-2334
> URL: https://issues.apache.org/jira/browse/SOLR-2334
> Project: Solr
>  Issue Type: Test
>  Components: clients - java
>Affects Versions: 1.4
> Environment: debian lenny and squeez  , 1386 arch
>Reporter: ahmad maher
> Fix For: 1.4.2
>
>
> Dears,
> i use icu4j for UnicodeNormalization in schema.xml like that
> "
>  version="icu4j" composed="false" remove_diacritics="true" 
> remove_modifiers="true" fold="true"/>
> "
> and if i use any token except English tokens in filter class ,  it return 
> error,  like in using solr.PatternReplaceFilterFactory
> how can i use :
> transliterate rule  and transform rule in solr schema or config file ?
> as mentioned here http://userguide.icu-project.org/transforms/general
> can any one help me ?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2334) solr. icu4j for Unicode Normalization

2011-02-14 Thread ahmad maher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992418#comment-12992418
 ] 

ahmad maher edited comment on SOLR-2334 at 2/14/11 9:22 AM:


thank you for replay,

how can use Pattern Replace  OR char map or replace in my solr schema file
using the ICU4j for non English patterns or chars ?
OR 
adding Arabic Normalization







  was (Author: amd_maher):
thank you for replay,

how can use Pattern Replace  OR char map or replace in my solr schema file
using the ICU4j for non English patterns or chars ?




  
> solr. icu4j for Unicode Normalization
> -
>
> Key: SOLR-2334
> URL: https://issues.apache.org/jira/browse/SOLR-2334
> Project: Solr
>  Issue Type: Test
>  Components: clients - java
>Affects Versions: 1.4
> Environment: debian lenny and squeez  , 1386 arch
>Reporter: ahmad maher
> Fix For: 1.4.2
>
>
> Dears,
> i use icu4j for UnicodeNormalization in schema.xml like that
> "
>  version="icu4j" composed="false" remove_diacritics="true" 
> remove_modifiers="true" fold="true"/>
> "
> and if i use any token except English tokens in filter class ,  it return 
> error,  like in using solr.PatternReplaceFilterFactory
> how can i use :
> transliterate rule  and transform rule in solr schema or config file ?
> as mentioned here http://userguide.icu-project.org/transforms/general
> can any one help me ?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2011-02-14 Thread JohnWu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994224#comment-12994224
 ] 

JohnWu commented on SOLR-1395:
--

TomLiu:
   how to configure the  ISolrServer, let it dispatch request to SolrCore? only 
in solrconfig.xml(searchHandler) and properties(DeployableSOlrKattaServer)?

JohnWu

> Integrate Katta
> ---
>
> Key: SOLR-1395
> URL: https://issues.apache.org/jira/browse/SOLR-1395
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: Next
>
> Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
> back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
> katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
> log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
> solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
> solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
> solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
> solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
> zkclient-0.1-dev.jar, zookeeper-3.2.1.jar
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We'll integrate Katta into Solr so that:
> * Distributed search uses Hadoop RPC
> * Shard/SolrCore distribution and management
> * Zookeeper based failover
> * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org