[jira] Issue Comment Edited: (LUCENE-1567) New flexible query parser

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720379#action_12720379
 ] 

Grant Ingersoll edited comment on LUCENE-1567 at 6/16/09 3:16 PM:
--

I need an MD5/SHA1 hash 
(http://incubator.apache.org/ip-clearance/ip-clearance-template.html) for the 
exact code listed in the software grant.  Also include the version number of 
the software used to create the hash. 

Please also upload that code as a tarball on this issue.  No need to worry 
about the patches for now. 

See https://issues.apache.org/jira/browse/INCUBATOR-77 for example.

  was (Author: gsingers):
I need an MD5/SHA1 hash 
(http://incubator.apache.org/ip-clearance/ip-clearance-template.html) for the 
exact code listed in the software grant.  Also include the version number of 
the software used to create the hash.  

See https://issues.apache.org/jira/browse/INCUBATOR-77 for example.
  
 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch, 
 QueryParser_restructure_meetup_june2009_v2.pdf


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in 

[jira] Commented: (LUCENE-1567) New flexible query parser

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720386#action_12720386
 ] 

Grant Ingersoll commented on LUCENE-1567:
-

OK, only outstanding items for clearance are:
1. tarball and hash
2. Vote on Incubator for clearance.


 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch, 
 QueryParser_restructure_meetup_june2009_v2.pdf


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1567) New flexible query parser

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720417#action_12720417
 ] 

Grant Ingersoll commented on LUCENE-1567:
-

Commit is separate from IP Clearance and you can't commit until the clearance 
is accepted.

 I just need the tarball for the code that was referenced in the software grant 
along with a hash on it.  In the grant, you have a file directory listing 
describing the code.  Take that file listing, tar it up and run md5 on it.



 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch, 
 QueryParser_restructure_meetup_june2009_v2.pdf


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, 

[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720037#action_12720037
 ] 

Grant Ingersoll commented on MAHOUT-65:
---

Hey Jeff,

Minor request, it seems like you have some sort of reformatting going on that 
causes the patch to contain all kinds of formatting changes that make it a lot 
harder to see the actual changes.  

In thinking about this a little bit more, is there a way to just name a vector 
and a row in a Matrix.  All I really want right now is to be able to track 
which Vector is associated with which document, and I could do this by setting 
a unique name on the Vector and having that serialized.  The name itself could 
be stored in the first entry (for SparseVector, it would have to coincide with 
the sCardinality stuff.

I'm fine with all the other label stuff, too.

Also, the patch doesn't apply because of the JSONVectorAdapter

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720045#action_12720045
 ] 

Grant Ingersoll commented on MAHOUT-65:
---

Is the only way to add bindings by setting the map?  Seems like
{code}
set(String label, int index, double value) 
{code}
would be useful.  And, that if bindings is null, it would create a new map.

Also, I'll see if I can work up the name thing.


 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-134) [PATCH] Cluster decode error handling

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-134.


   Resolution: Fixed
Fix Version/s: 0.2

Committed revision 785197.

 [PATCH] Cluster decode error handling
 -

 Key: MAHOUT-134
 URL: https://issues.apache.org/jira/browse/MAHOUT-134
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.2
Reporter: Robert Burrell Donkin
 Fix For: 0.2

 Attachments: mahout-cluster-format-error.patch


 ATM the javadocs are unclear as to whether null is an acceptable return value 
 and callers do not null check the return value. However, the implementation 
 may return null in or throw other runtime exceptions when the format is not 
 correct. This makes it hard to diagnose when there's a problem with the 
 format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720161#action_12720161
 ] 

Grant Ingersoll commented on MAHOUT-65:
---

OK, I will work up a patch for the name thing on a vector, unless you think 
that can be handled through the bindings thing.  Basically, I think we need a 
way to name a vector and have it carried through.

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720199#action_12720199
 ] 

Grant Ingersoll commented on MAHOUT-65:
---

That works for Matrix.  For Vector, I was thinking, probably naively, we simply 
need to be able to add a name attribute.  For MAHOUT-130, I just dumped the 
column/cell labels out separately.  Like you said, I'm not sure we want all of 
that serialized. 

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, 
 MAHOUT-65d.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-65:
--

Attachment: MAHOUT-65-name.patch

Add name attribute.  Also added some docs on equals and added a strict 
equivalence notion that can be useful if one cares  about the implementation.



 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, 
 MAHOUT-65c.patch, MAHOUT-65d.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720302#action_12720302
 ] 

Grant Ingersoll commented on MAHOUT-65:
---

Jeff,

One comment on the GSON serialization stuff.  It can get pretty verbose storing 
the class name repeatedly, although I do realize it's a drop in the bucket 
compared to the vector itself.  Perhaps we could do like Solr does and, if some 
abbreviated form is present where a class name is required (maybe 'DV' or 'SV') 
it could know to use those forms, otherwise it can do the full class lookup.  
Might just save a little bit on size of a serialized file, which I imagine can 
add up.

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, 
 MAHOUT-65c.patch, MAHOUT-65d.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-131) Vector improvements

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-131:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Vector improvements
 ---

 Key: MAHOUT-131
 URL: https://issues.apache.org/jira/browse/MAHOUT-131
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2

 Attachments: MAHOUT-131.patch, MAHOUT-131.patch


 Vector and it's implementations could use a few things:
 1. DenseVector should implement equals and hashCode similar to SparseVector
 2. The VectorView asFormatString() is not compatible with actually recreating 
 any type of vector.  
 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is 
 able to do a round trip.
 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two 
 vectors and compares them for equality, regardless of their implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-126:
---

Fix Version/s: 0.2
Affects Version/s: 0.2
   Status: Patch Available  (was: Open)

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-126:
---

Attachment: MAHOUT-126.patch

Here's a version that is brought up to trunk and adds in MAHOUT-65-name.patch 
to allow for labeling the vectors.

Next, I'm going to run the output through some clustering

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-65:
--

Attachment: MAHOUT-65-name.patch

implement hashCode better, require equals and hashcode as part of the 
interface, same as java.util.List

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65-name.patch, MAHOUT-65-name.patch, 
 MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-65:
--

Attachment: MAHOUT-65-name.patch

How about a version where the tests actually pass?  Will commit shortly.

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65-name.patch, MAHOUT-65-name.patch, 
 MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, 
 MAHOUT-65d.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2009-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720345#action_12720345
 ] 

Grant Ingersoll commented on MAHOUT-65:
---

Committed the name stuff: 
Committed revision 785386.

 Add Element Labels to Vectors and Matrices
 --

 Key: MAHOUT-65
 URL: https://issues.apache.org/jira/browse/MAHOUT-65
 Project: Mahout
  Issue Type: New Feature
  Components: Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Attachments: MAHOUT-65-name.patch, MAHOUT-65-name.patch, 
 MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, 
 MAHOUT-65d.patch


 Many applications can benefit by accessing elements in vectors and matrices 
 using String labels in addition to numeric indices. Investigate adding such a 
 capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-15 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719524#action_12719524
 ] 

Grant Ingersoll commented on MAHOUT-126:


Yeah, still needs the labeling stuff.

As for weights, you should be able to pass in a Weight object.  See the TFIDF 
implementation.  Likely still needs some work.

As for the Lucene error, I thought I had updated the Lucene version to be 
2.9-dev, which I believe makes this all right.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-132) [PATCH] Push magic names into public constants

2009-06-14 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-132:
--

Assignee: Grant Ingersoll

 [PATCH] Push magic names into public constants
 --

 Key: MAHOUT-132
 URL: https://issues.apache.org/jira/browse/MAHOUT-132
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.2
Reporter: Robert Burrell Donkin
Assignee: Grant Ingersoll
 Attachments: mahout-constants.patch


 ATM the examples (and any similar code) need to hard code magic strings for 
 directories. This makes the code more fragile and more difficult to 
 understand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-132) [PATCH] Push magic names into public constants

2009-06-14 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-132.


   Resolution: Fixed
Fix Version/s: 0.2

Committed revision 784640.

 [PATCH] Push magic names into public constants
 --

 Key: MAHOUT-132
 URL: https://issues.apache.org/jira/browse/MAHOUT-132
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.2
Reporter: Robert Burrell Donkin
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-constants.patch


 ATM the examples (and any similar code) need to hard code magic strings for 
 directories. This makes the code more fragile and more difficult to 
 understand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-131) Vector improvements

2009-06-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-131:
---

Attachment: MAHOUT-131.patch

Updated patch implements equals/hashcode for all the Vectors and puts in 
various tests related to these issues.  It should be the case now that 
Vector.equals() acts just like List.equals(), namely that two vectors 
containing the same elements are equivalent regardless of the implementation.

 Vector improvements
 ---

 Key: MAHOUT-131
 URL: https://issues.apache.org/jira/browse/MAHOUT-131
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2

 Attachments: MAHOUT-131.patch, MAHOUT-131.patch


 Vector and it's implementations could use a few things:
 1. DenseVector should implement equals and hashCode similar to SparseVector
 2. The VectorView asFormatString() is not compatible with actually recreating 
 any type of vector.  
 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is 
 able to do a round trip.
 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two 
 vectors and compares them for equality, regardless of their implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719117#action_12719117
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. Would it be useful to take a shot at rewriting SparseVector to use this?

You could do that, or an alternate implementation.  Is there any case where one 
wouldn't want this?  Also, I wouldn't mind a little better name than 
FastIntDouble.  ;-)

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719118#action_12719118
 ] 

Grant Ingersoll commented on MAHOUT-121:


Also, seems like we could split out the original two issues that Shashikant 
brought up, right?

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (LUCENE-1687) Remove ExtendedFieldCache by rolling functionality into FieldCache

2009-06-12 Thread Grant Ingersoll (JIRA)
Remove ExtendedFieldCache by rolling functionality into FieldCache
--

 Key: LUCENE-1687
 URL: https://issues.apache.org/jira/browse/LUCENE-1687
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9


It is silly that we have ExtendedFieldCache.  It is a workaround to our 
supposed back compatibility problem.  This patch will merge the 
ExtendedFieldCache interface into FieldCache, thereby breaking back 
compatibility, but creating a much simpler API for FieldCache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718815#action_12718815
 ] 

Grant Ingersoll commented on LUCENE-1676:
-

OK, I moved to contrib/CHANGES.  I'm going to commit this today.


 New Token filter for adding payloads in-stream
 

 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1676.patch


 This TokenFilter is able to split a token based on a delimiter and use one 
 part as the token and the other part as a payload.  This allows someone to 
 include payloads inline with tokens (presumably setup by a pipeline ahead of 
 time).  An example is apropos.  Given a | delimiter, we could have a stream 
 that looks like:
 {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
 dogs|NN{quote}
 In this case, this would produce tokens and payloads (assuming whitespace 
 tokenization):
 Token: the
 Payload: null
 Token: quick
 Payload: JJ
 Token: red
 Pay: JJ.
 and so on.
 This patch will also support pluggable encoders for the payloads, so it can 
 convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718817#action_12718817
 ] 

Grant Ingersoll commented on LUCENE-1676:
-

BTW, I'm curious if people have a better way to convert from char[] to byte[] 
for encoding the payloads (see FloatEncoder), other than going through Strings.

 New Token filter for adding payloads in-stream
 

 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1676.patch


 This TokenFilter is able to split a token based on a delimiter and use one 
 part as the token and the other part as a payload.  This allows someone to 
 include payloads inline with tokens (presumably setup by a pipeline ahead of 
 time).  An example is apropos.  Given a | delimiter, we could have a stream 
 that looks like:
 {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
 dogs|NN{quote}
 In this case, this would produce tokens and payloads (assuming whitespace 
 tokenization):
 Token: the
 Payload: null
 Token: quick
 Payload: JJ
 Token: red
 Pay: JJ.
 and so on.
 This patch will also support pluggable encoders for the payloads, so it can 
 convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1687) Remove ExtendedFieldCache by rolling functionality into FieldCache

2009-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718845#action_12718845
 ] 

Grant Ingersoll commented on LUCENE-1687:
-

True, but you know how we are about adding methods to an interface!

 Remove ExtendedFieldCache by rolling functionality into FieldCache
 --

 Key: LUCENE-1687
 URL: https://issues.apache.org/jira/browse/LUCENE-1687
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9


 It is silly that we have ExtendedFieldCache.  It is a workaround to our 
 supposed back compatibility problem.  This patch will merge the 
 ExtendedFieldCache interface into FieldCache, thereby breaking back 
 compatibility, but creating a much simpler API for FieldCache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718943#action_12718943
 ] 

Grant Ingersoll commented on LUCENE-1676:
-

I grabbed Apache Harmony's Integer.parseInt() code and converted it to take in 
a char array, which should speed up the IntegerEncoder.  However, the 
Float.parseInt implementation relies on some constructs that are not available 
in JDK 1.4, so that one is going to have to stay as it is.

The main problem lies in the reliance on the HexStringParser 
(https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/org/apache/harmony/luni/util/HexStringParser.java)
 which is in need of some Long specific attributes that are either JDK1.4 or 
are Harmony specific attributes of Long (I didn't take the time to investigate)

At any rate, I added the Integer stuff to ArrayUtils and also added some tests.

For reference, see: 
https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/org/apache/harmony/luni/util/FloatingPointParser.java

https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/java/lang/Integer.java



 New Token filter for adding payloads in-stream
 

 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1676.patch


 This TokenFilter is able to split a token based on a delimiter and use one 
 part as the token and the other part as a payload.  This allows someone to 
 include payloads inline with tokens (presumably setup by a pipeline ahead of 
 time).  An example is apropos.  Given a | delimiter, we could have a stream 
 that looks like:
 {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
 dogs|NN{quote}
 In this case, this would produce tokens and payloads (assuming whitespace 
 tokenization):
 Token: the
 Payload: null
 Token: quick
 Payload: JJ
 Token: red
 Pay: JJ.
 and so on.
 This patch will also support pluggable encoders for the payloads, so it can 
 convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-12 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-1676.
-

   Resolution: Fixed
Lucene Fields:   (was: [New])

Committed revision 784297.

 New Token filter for adding payloads in-stream
 

 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1676.patch


 This TokenFilter is able to split a token based on a delimiter and use one 
 part as the token and the other part as a payload.  This allows someone to 
 include payloads inline with tokens (presumably setup by a pipeline ahead of 
 time).  An example is apropos.  Given a | delimiter, we could have a stream 
 that looks like:
 {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
 dogs|NN{quote}
 In this case, this would produce tokens and payloads (assuming whitespace 
 tokenization):
 Token: the
 Payload: null
 Token: quick
 Payload: JJ
 Token: red
 Pay: JJ.
 and so on.
 This patch will also support pluggable encoders for the payloads, so it can 
 convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718447#action_12718447
 ] 

Grant Ingersoll commented on LUCENE-1676:
-

bq. Shouldn't the CHANGES entry in this patch go into contrib/CHANGES?

It can, I've never quite been sure.  I think more people read the top-level 
CHANGES, thus it is more likely to be noticed, but I'm fine either way.

 New Token filter for adding payloads in-stream
 

 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1676.patch


 This TokenFilter is able to split a token based on a delimiter and use one 
 part as the token and the other part as a payload.  This allows someone to 
 include payloads inline with tokens (presumably setup by a pipeline ahead of 
 time).  An example is apropos.  Given a | delimiter, we could have a stream 
 that looks like:
 {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
 dogs|NN{quote}
 In this case, this would produce tokens and payloads (assuming whitespace 
 tokenization):
 Token: the
 Payload: null
 Token: quick
 Payload: JJ
 Token: red
 Pay: JJ.
 and so on.
 This patch will also support pluggable encoders for the payloads, so it can 
 convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark

2009-06-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718450#action_12718450
 ] 

Grant Ingersoll commented on LUCENE-979:


bq. What are the old benchmark utilities? 

It's like one class from the pre-Doron Task oriented approach.  I believe it's 
called Benchmark.java and was only able to do a few benchmarking tasks.

 Remove Deprecated Benchmarking Utilities from contrib/benchmark
 ---

 Key: LUCENE-979
 URL: https://issues.apache.org/jira/browse/LUCENE-979
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Grant Ingersoll
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The old Benchmark utilities in contrib/benchmark have been deprecated and 
 should be removed in 2.9 of Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark

2009-06-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718463#action_12718463
 ] 

Grant Ingersoll commented on LUCENE-979:


Yes.

 Remove Deprecated Benchmarking Utilities from contrib/benchmark
 ---

 Key: LUCENE-979
 URL: https://issues.apache.org/jira/browse/LUCENE-979
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Grant Ingersoll
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The old Benchmark utilities in contrib/benchmark have been deprecated and 
 should be removed in 2.9 of Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (MAHOUT-131) Vector improvements

2009-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-131:
---

Attachment: MAHOUT-131.patch

Some minor changes to Vector, etc.

 Vector improvements
 ---

 Key: MAHOUT-131
 URL: https://issues.apache.org/jira/browse/MAHOUT-131
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2

 Attachments: MAHOUT-131.patch


 Vector and it's implementations could use a few things:
 1. DenseVector should implement equals and hashCode similar to SparseVector
 2. The VectorView asFormatString() is not compatible with actually recreating 
 any type of vector.  
 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is 
 able to do a round trip.
 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two 
 vectors and compares them for equality, regardless of their implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1209) Site search powered by Lucene/Solr

2009-06-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718507#action_12718507
 ] 

Grant Ingersoll commented on SOLR-1209:
---

bq. Just a small doubt - I assume Google shares revenue generated from clicks 
on the site search page with ASF. Are we sure this is not affecting ASF 
money-wise?

They don't share the revenue.  All the Google box is right now is a Forrest 
auto-generated, default, plugin.


 Site search powered by Lucene/Solr
 --

 Key: SOLR-1209
 URL: https://issues.apache.org/jira/browse/SOLR-1209
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1209.patch


 For a number of years now, the Lucene community has been criticized for not 
 eating our own dog food when it comes to search. My company has built and 
 hosts a site search (http://www.lucidimagination.com/search) that is powered 
 by Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
 community. Additionally, it allows one to search all of the Solr content from 
 a single place, including web, wiki, JIRA and mail archives. See also 
 http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
 A preview of the site (for Mahout) is available at 
 http://people.apache.org/~gsingers/mahout/site/publish/
 Lucid has a fault tolerant setup with replication and fail over as well as 
 monitoring services in place. We are committed to maintaining and expanding 
 the search capabilities on the site.
 The following patch adds a skin to the Forrest site that enables the Solr 
 site to search Solr only content using Lucene/Solr. When a search is 
 submitted, it automatically selects the Solr facet such that only Solr 
 content is searched. From there, users can then narrow/broaden their search 
 criteria.
 I'm submitting this patch to Solr first, as we'd like to roll out our 
 capabilities to some of the smaller communities first and then broaden to the 
 rest of the Lucene ecosystem.
 I plan on committing in a 3 or 4 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-131) Vector improvements

2009-06-10 Thread Grant Ingersoll (JIRA)
Vector improvements
---

 Key: MAHOUT-131
 URL: https://issues.apache.org/jira/browse/MAHOUT-131
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2


Vector and it's implementations could use a few things:

1. DenseVector should implement equals and hashCode similar to SparseVector
2. The VectorView asFormatString() is not compatible with actually recreating 
any type of vector.  
3. Add tests to VectorTest that assert that decodeFormat/asFormatString is able 
to do a round trip.
4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two 
vectors and compares them for equality, regardless of their implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717819#action_12717819
 ] 

Grant Ingersoll commented on LUCENE-1678:
-

I frankly don't like renaming something like this.  This is, once again, a case 
of back compatibility biting us.  If instead of working around back compat. we 
had just made Analyzer.tokenStream be reusable, we wouldn't have to do this.  
Now, instead, we are going to have a convoluted name for something (reusableTS).

In my mind, better to just make .tokenStream do the right thing and get rid of 
reusableTokenStream.

 Deprecate Analyzer.tokenStream
 --

 Key: LUCENE-1678
 URL: https://issues.apache.org/jira/browse/LUCENE-1678
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The addition of reusableTokenStream to the core analyzers unfortunately broke 
 back compat of external subclasses:
 
 http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
 On upgrading, such subclasses would silently not be used anymore, since 
 Lucene's indexing invokes reusableTokenStream.
 I think we should should at least deprecate Analyzer.tokenStream, today, so 
 that users see deprecation warnings if their classes override this method.  
 But going forward when we want to change the API of core classes that are 
 extended, I think we have to  introduce entirely new classes, to keep back 
 compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717888#action_12717888
 ] 

Grant Ingersoll commented on LUCENE-1678:
-

bq. If there are sane/smart ways to change our back compat policy, I think you 
have seen that no one would object.

The sane/smart way is to do it on a case by case basis.  Here is a specific 
case.  Generalizing it a bit, the place where it should be more easily 
relaxable are the cases where we know very few people make customizations, as 
in implementing Fieldable or FieldCache.

As for this specific case, the original change was the thing that broke back 
compat.  So, given it is already broken, why not fix it the right way?

 Deprecate Analyzer.tokenStream
 --

 Key: LUCENE-1678
 URL: https://issues.apache.org/jira/browse/LUCENE-1678
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The addition of reusableTokenStream to the core analyzers unfortunately broke 
 back compat of external subclasses:
 
 http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
 On upgrading, such subclasses would silently not be used anymore, since 
 Lucene's indexing invokes reusableTokenStream.
 I think we should should at least deprecate Analyzer.tokenStream, today, so 
 that users see deprecation warnings if their classes override this method.  
 But going forward when we want to change the API of core classes that are 
 extended, I think we have to  introduce entirely new classes, to keep back 
 compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm

2009-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-130.


Resolution: Fixed

Committed Ted's patch

 Vector should allow for other normalize powers than the L-2 norm
 

 Key: MAHOUT-130
 URL: https://issues.apache.org/jira/browse/MAHOUT-130
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-130-both-ways.patch, 
 MAHOUT-130-slight-tweaks.patch, MAHOUT-130.patch, MAHOUT-130.patch


 Modify Vector to allow other normalize functions for the Vector
 See 
 http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-126:
---

Attachment: MAHOUT-126.patch

Here's a first attempt at my thoughts based on the two previous patches, plus 
some other ideas.

The main gist of the idea centers around the VectorIterable interface and is 
driven by the o.a.mahout.utils.vectors.Driver class.

Note, I dropped the Lucene indexing part, as I don't think we need to be in the 
game of creating Lucene indexes.  That is a well known and well document 
process that is available elsewhere.  In fact, for this particular piece, I 
indexed Wikipedia in Solr and then pointed the Driver class at the Lucene index.

See 
http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text 
for details on using.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm

2009-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717245#action_12717245
 ] 

Grant Ingersoll commented on MAHOUT-108:


http://cwiki.apache.org/MAHOUT/howtocontribute.html

 Implementation of Assoication Rules learning by Apriori algorithm
 -

 Key: MAHOUT-108
 URL: https://issues.apache.org/jira/browse/MAHOUT-108
 Project: Mahout
  Issue Type: Task
 Environment: Linux, Hadoop-0.17.1
Reporter: chao deng
 Fix For: 0.2

   Original Estimate: 504h
  Remaining Estimate: 504h

 Target: Association Rules learning is a popular method for discovering 
 interesting relations between variables in large databases. Here, we would 
 implement the Apriori algorithm using HadoopMapreduce parallel techniques.
 Applications: Typically, association rules  learning is used to discover 
 regularities between products in large scale transaction data in 
 supermarkets. For example, the rule  {onions, patatoes}-beef found in the 
 sales data would indicate that if a customer buys onions and potatoes 
 together, he or she is likely to also buy beef. Such information can be used 
 as the basis for decisions about marketing activities. In addition to the 
 market basket analysis, association rules are employed today in many 
 application areas including Web usage mining, intrusion detection and 
 bioinformatics.
 Apriori algorithm: Apriori is the best-known algorithm to mine association 
 rules. It uses a breadth-first search strategy to counting the support of 
 itemsets and uses a candidate generation function which exploits the downward 
 closure property of support

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm

2009-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-130:
---

Attachment: MAHOUT-130.patch

Draft.  Not sure if the optimizations make sense or not, but I think they do.

Patch applies in the core directory, not the top level

 Vector should allow for other normalize powers than the L-2 norm
 

 Key: MAHOUT-130
 URL: https://issues.apache.org/jira/browse/MAHOUT-130
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-130.patch


 Modify Vector to allow other normalize functions for the Vector
 See 
 http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm

2009-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-130:
---

Attachment: (was: MAHOUT-130.patch)

 Vector should allow for other normalize powers than the L-2 norm
 

 Key: MAHOUT-130
 URL: https://issues.apache.org/jira/browse/MAHOUT-130
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-130.patch


 Modify Vector to allow other normalize functions for the Vector
 See 
 http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm

2009-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-130:
---

Attachment: MAHOUT-130.patch

No reason the power needs to be int only, I suppose

 Vector should allow for other normalize powers than the L-2 norm
 

 Key: MAHOUT-130
 URL: https://issues.apache.org/jira/browse/MAHOUT-130
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-130.patch


 Modify Vector to allow other normalize functions for the Vector
 See 
 http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm

2009-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717511#action_12717511
 ] 

Grant Ingersoll commented on MAHOUT-130:


D'oh.  My bad.  I had initialized val = 1 instead of val = 0;

All looks good now.

 Vector should allow for other normalize powers than the L-2 norm
 

 Key: MAHOUT-130
 URL: https://issues.apache.org/jira/browse/MAHOUT-130
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-130.patch


 Modify Vector to allow other normalize functions for the Vector
 See 
 http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm

2009-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-130:
---

Attachment: MAHOUT-130.patch

Adds 0 norm and Infinite norm as well as Ted and David's suggestions, as well 
as maxValue and maxValueIndex methods.

 Vector should allow for other normalize powers than the L-2 norm
 

 Key: MAHOUT-130
 URL: https://issues.apache.org/jira/browse/MAHOUT-130
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-130-both-ways.patch, MAHOUT-130.patch, 
 MAHOUT-130.patch


 Modify Vector to allow other normalize functions for the Vector
 See 
 http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1208) The Default SearchComponents (QueryComponent, etc.) cannot currently support SolrCoreAware or ResourceLoaderAware

2009-06-08 Thread Grant Ingersoll (JIRA)
The Default SearchComponents (QueryComponent, etc.) cannot currently support 
SolrCoreAware or ResourceLoaderAware
-

 Key: SOLR-1208
 URL: https://issues.apache.org/jira/browse/SOLR-1208
 Project: Solr
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


The Default SearchComponents are not instantiated via the SolrResourceLoader 
and are thus not put in the waiting lists for SolrCoreAware and 
ResourceLoaderAware.  Thus, they are not constructed in the same that other 
SearchComponents might be constructed.

See 
http://www.lucidimagination.com/search/document/ef69fdc7dfb17428/default_searchcomponents_and_solrcoreaware



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1209) Site search powered by Lucene/Solr

2009-06-08 Thread Grant Ingersoll (JIRA)
Site search powered by Lucene/Solr
--

 Key: SOLR-1209
 URL: https://issues.apache.org/jira/browse/SOLR-1209
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor


For a number of years now, the Lucene community has been criticized for not 
eating our own dog food when it comes to search. My company has built and 
hosts a site search (http://www.lucidimagination.com/search) that is powered by 
Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
community. Additionally, it allows one to search all of the Solr content from a 
single place, including web, wiki, JIRA and mail archives. See also 
http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org

A preview of the site (for Mahout) is available at 
http://people.apache.org/~gsingers/mahout/site/publish/

Lucid has a fault tolerant setup with replication and fail over as well as 
monitoring services in place. We are committed to maintaining and expanding the 
search capabilities on the site.

The following patch adds a skin to the Forrest site that enables the Solr site 
to search Solr only content using Lucene/Solr. When a search is submitted, it 
automatically selects the Solr facet such that only Solr content is searched. 
From there, users can then narrow/broaden their search criteria.

I'm submitting this patch to Solr first, as we'd like to roll out our 
capabilities to some of the smaller communities first and then broaden to the 
rest of the Lucene ecosystem.

I plan on committing in a 3 or 4 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1209) Site search powered by Lucene/Solr

2009-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1209:
--

Attachment: SOLR-1209.patch

Patch changing the skin to use Solr powered search

 Site search powered by Lucene/Solr
 --

 Key: SOLR-1209
 URL: https://issues.apache.org/jira/browse/SOLR-1209
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1209.patch


 For a number of years now, the Lucene community has been criticized for not 
 eating our own dog food when it comes to search. My company has built and 
 hosts a site search (http://www.lucidimagination.com/search) that is powered 
 by Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
 community. Additionally, it allows one to search all of the Solr content from 
 a single place, including web, wiki, JIRA and mail archives. See also 
 http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
 A preview of the site (for Mahout) is available at 
 http://people.apache.org/~gsingers/mahout/site/publish/
 Lucid has a fault tolerant setup with replication and fail over as well as 
 monitoring services in place. We are committed to maintaining and expanding 
 the search capabilities on the site.
 The following patch adds a skin to the Forrest site that enables the Solr 
 site to search Solr only content using Lucene/Solr. When a search is 
 submitted, it automatically selects the Solr facet such that only Solr 
 content is searched. From there, users can then narrow/broaden their search 
 criteria.
 I'm submitting this patch to Solr first, as we'd like to roll out our 
 capabilities to some of the smaller communities first and then broaden to the 
 rest of the Lucene ecosystem.
 I plan on committing in a 3 or 4 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1567) New flexible query parser

2009-06-04 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716311#action_12716311
 ] 

Grant Ingersoll commented on LUCENE-1567:
-

The software Grant has been received and filed.  I will update the paperwork 
and work to finish this out next week, such that we can then work to commit it.

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-128) maven parent not included in build

2009-06-04 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-128.


Resolution: Fixed

committed

 maven parent not included in build
 --

 Key: MAHOUT-128
 URL: https://issues.apache.org/jira/browse/MAHOUT-128
 Project: Mahout
  Issue Type: Bug
Reporter: Benson Margulies
 Attachments: pom.diff


 The maven parent isn't included as a module, so it's pom isn't installed, so 
 building some other project that depends on mahout-core fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-02 Thread Grant Ingersoll (JIRA)
New Token filter for adding payloads in-stream


 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9


This TokenFilter is able to split a token based on a delimiter and use one part 
as the token and the other part as a payload.  This allows someone to include 
payloads inline with tokens (presumably setup by a pipeline ahead of time).  An 
example is apropos.  Given a | delimiter, we could have a stream that looks 
like:
{quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
dogs|NN{quote}

In this case, this would produce tokens and payloads (assuming whitespace 
tokenization):
Token: the
Payload: null

Token: quick
Payload: JJ

Token: red
Pay: JJ.

and so on.

This patch will also support pluggable encoders for the payloads, so it can 
convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1676) New Token filter for adding payloads in-stream

2009-06-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1676:


Attachment: LUCENE-1676.patch

Here's a first draft of this.  See the test case for an example.

 New Token filter for adding payloads in-stream
 

 Key: LUCENE-1676
 URL: https://issues.apache.org/jira/browse/LUCENE-1676
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1676.patch


 This TokenFilter is able to split a token based on a delimiter and use one 
 part as the token and the other part as a payload.  This allows someone to 
 include payloads inline with tokens (presumably setup by a pipeline ahead of 
 time).  An example is apropos.  Given a | delimiter, we could have a stream 
 that looks like:
 {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ 
 dogs|NN{quote}
 In this case, this would produce tokens and payloads (assuming whitespace 
 tokenization):
 Token: the
 Payload: null
 Token: quick
 Payload: JJ
 Token: red
 Pay: JJ.
 and so on.
 This patch will also support pluggable encoders for the payloads, so it can 
 convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (MAHOUT-126) Prepare document vectors from the text

2009-05-29 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-126:
--

Assignee: Grant Ingersoll

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-05-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714414#action_12714414
 ] 

Grant Ingersoll commented on MAHOUT-126:


See SOLR-1193.  

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-05-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714509#action_12714509
 ] 

Grant Ingersoll commented on MAHOUT-126:


So just kind of brainstorming here, but I think we should create a separate 
Module for this kind of stuff, to keep out of core and give us some more 
flexibility in regards to dependencies, etc.

Also (and I realize this is just a start patch), I think we should assume a 
Lucene index exists already instead of maintaining code to actually create an 
index.  There are a lot of ways to do that and people will likely have 
different fields, etc.  For instance, Solr can provide all of the capabilities 
here and it has distributed support, so it can scale.  Moreover, though, is 
people may have the info in a DB or in other places.  I realize we need baby 
steps, but...

I'll try to post a patch this afternoon that takes this effort and melds it 
with some of my ideas for demo purposes.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: mahout-126-benson.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-05-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714515#action_12714515
 ] 

Grant Ingersoll commented on MAHOUT-126:


Shashikant,

Couple of comments on the Lucene specific stuff, though, so that you guys can 
speed up what you have.

First off, have a look at Lucene's support of TermVectorMapper.  Much like SAX, 
it gives you a call back mechanism such that you don't have to construct two 
different data structures (i.e. many people incorrectly use the DOM to parse 
XML and then extract out of the DOM into their own Data Structure when they 
should use SAX instead).

You might have a look at the TermVectorComponent in Solr, as it pretty much 
does what you are looking to do in this patch and I believe it to be more 
efficient.

Seems like we should be able to avoid caching the whole term list in memory.  
At a minimum, if you are going to, allTerms should be a MapString, Integer 
that stores the term and it's DF (doc freq.), as you are currently doing the DF 
lookup twice, AFAICT.  DF lookup is expensive in Lucene.  If you don't cache 
the whole list, we should at least have an LRU cache for DF.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: mahout-126-benson.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-120) Site search powered by Solr

2009-05-28 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-120.


Resolution: Fixed

Committed revision 779594.

 Site search powered by Solr
 ---

 Key: MAHOUT-120
 URL: https://issues.apache.org/jira/browse/MAHOUT-120
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-120.patch, MAHOUT-120.patch


 For a number of years now, the Lucene community has been criticized for not 
 eating our own dog food when it comes to search.  My company has built and 
 hosts a site search (http://www.lucidimagination.com/search) that is powered 
 by Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
 community.   Additionally, it allows one to search all of the Mahout content 
 from a single place, including web, wiki, JIRA and mail archives.   See also 
 http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
 A preview of the site is available at 
 http://people.apache.org/~gsingers/mahout/site/publish/
 Lucid has a fault tolerant setup with replication and fail over as well as 
 monitoring services in place.  We are committed to maintaining and expanding 
 the search capabilities on the site.
 The following patch adds a skin to the Forrest site that enables the Mahout 
 site to search Mahout only content using Lucene/Solr.  When a search is 
 submitted, it automatically selects the Mahout facet such that only Mahout 
 content is searched.  From there, users can then narrow/broaden their search 
 criteria.
 I'm submitting this patch to Mahout first, as we'd like to roll out our 
 capabilities to some of the smaller communities first and then broaden to the 
 rest of the Lucene ecosystem.
 I plan on committing in a 3 or 4 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1193) Provide option for TermVectorComponent to provide a way of retrieving TV info around a position instead of the whole vector

2009-05-28 Thread Grant Ingersoll (JIRA)
Provide option for TermVectorComponent to provide a way of retrieving TV info 
around a position instead of the whole vector
---

 Key: SOLR-1193
 URL: https://issues.apache.org/jira/browse/SOLR-1193
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor


It's often useful to retrieve TermVector information around (within some user 
specified window) a specific position or offset.  The TermVectorComponent can 
easily be modifed to use a TermVectorMapper that is aware of position/offset 
information and only returns term info within the window.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (TIKA-235) Site search powered by Lucene/Solr

2009-05-28 Thread Grant Ingersoll (JIRA)
Site search powered by Lucene/Solr
--

 Key: TIKA-235
 URL: https://issues.apache.org/jira/browse/TIKA-235
 Project: Tika
  Issue Type: New Feature
Reporter: Grant Ingersoll
Priority: Minor


For a number of years now, the Lucene community has been criticized for not 
eating our own dog food when it comes to search. My company has built and 
hosts a site search (http://search.lucidimagination.com/) that is powered by 
Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
community. Additionally, it allows one to search all of the Tika content from a 
single place, including web, wiki, JIRA and mail archives. See also 
http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org

A sample of what it _might_ look like is at 
http://people.apache.org/~gsingers/tika/Note, however, I am not entirely 
sure how Tika deploys just yet, so there are a few issues w/ the display

Lucid has a fault tolerant setup with replication and fail over as well as 
monitoring services in place. We are committed to maintaining and expanding the 
search capabilities on the site.

The following patch adds the basics to Tika to support the search, but isn't 
entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-235) Site search powered by Lucene/Solr

2009-05-28 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated TIKA-235:
-

Attachment: TIKA-235.patch

First draft of a patch.  See also MAHOUT-120

 Site search powered by Lucene/Solr
 --

 Key: TIKA-235
 URL: https://issues.apache.org/jira/browse/TIKA-235
 Project: Tika
  Issue Type: New Feature
Reporter: Grant Ingersoll
Priority: Minor
 Attachments: TIKA-235.patch


 For a number of years now, the Lucene community has been criticized for not 
 eating our own dog food when it comes to search. My company has built and 
 hosts a site search (http://search.lucidimagination.com/) that is powered by 
 Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
 community. Additionally, it allows one to search all of the Tika content from 
 a single place, including web, wiki, JIRA and mail archives. See also 
 http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
 A sample of what it _might_ look like is at 
 http://people.apache.org/~gsingers/tika/Note, however, I am not entirely 
 sure how Tika deploys just yet, so there are a few issues w/ the display
 Lucid has a fault tolerant setup with replication and fail over as well as 
 monitoring services in place. We are committed to maintaining and expanding 
 the search capabilities on the site.
 The following patch adds the basics to Tika to support the search, but isn't 
 entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-63) Self Organizing Map

2009-05-27 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-63.
---

Resolution: Duplicate

See MAHOUT-64

 Self Organizing Map
 ---

 Key: MAHOUT-63
 URL: https://issues.apache.org/jira/browse/MAHOUT-63
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Reporter: Farid Bourennani
Priority: Minor
   Original Estimate: 120h
  Remaining Estimate: 120h

 Implementation of  the Kohonen's Self organizing map algorithm.
 Execution: run the SOMViewer .
 takes 300 iteration.
 - The algo is too slow because of:
   GUI: the current one is a temporary one, but should be replaced by 
 prefused library as suggested by Ted.
   Self-Organizing Maps: Batch Algorithm is faster than the sequentiel one 
 that I am currently using.
 - Documentation needs to be completed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors

2009-05-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-66?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713660#action_12713660
 ] 

Grant Ingersoll commented on MAHOUT-66:
---

Can this be closed?

 EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not 
 optimized for Sparse Vectors
 --

 Key: MAHOUT-66
 URL: https://issues.apache.org/jira/browse/MAHOUT-66
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Priority: Minor
 Attachments: MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch, 
 MAHOUT-66.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-125) Remove Deprecated Ant builds

2009-05-27 Thread Grant Ingersoll (JIRA)
Remove Deprecated Ant builds


 Key: MAHOUT-125
 URL: https://issues.apache.org/jira/browse/MAHOUT-125
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor


Finish transferring functionality from build-deprecated.xml files to Maven.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1188) TermVectorComponent Efficiency improvements

2009-05-27 Thread Grant Ingersoll (JIRA)
TermVectorComponent Efficiency improvements
---

 Key: SOLR-1188
 URL: https://issues.apache.org/jira/browse/SOLR-1188
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


The TermVectorComponent currently uses a TermVectorMapper that does not 
indicate to Lucene whether positions and offsets are of interest by overriding 
isIgnoringOffsets and isIgnoringPositions.  



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1188) TermVectorComponent Efficiency improvements

2009-05-27 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved SOLR-1188.
---

Resolution: Fixed

Committed simple patch to override the two methods.

 TermVectorComponent Efficiency improvements
 ---

 Key: SOLR-1188
 URL: https://issues.apache.org/jira/browse/SOLR-1188
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


 The TermVectorComponent currently uses a TermVectorMapper that does not 
 indicate to Lucene whether positions and offsets are of interest by 
 overriding isIgnoringOffsets and isIgnoringPositions.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-1177) Distributed TermsComponent

2009-05-22 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-1177:
-

Assignee: Grant Ingersoll

 Distributed TermsComponent
 --

 Key: SOLR-1177
 URL: https://issues.apache.org/jira/browse/SOLR-1177
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Matt Weber
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.5

 Attachments: SOLR-1177.patch, TermsComponent.java, 
 TermsComponent.patch


 TermsComponent should be distributed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

2009-05-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711150#action_12711150
 ] 

Grant Ingersoll commented on LUCENE-1550:
-

Committed revision 776704.

Thanks Tom!

 Add N-Gram String Matching for Spell Checking
 -

 Key: LUCENE-1550
 URL: https://issues.apache.org/jira/browse/LUCENE-1550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.9
Reporter: Thomas Morton
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch


 N-Gram version of edit distance based on paper by Grzegorz Kondrak, N-gram 
 similarity and distance. Proceedings of the Twelfth International Conference 
 on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  
 Buenos Aires, Argentina, November 2005. 
 http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137
 ] 

Grant Ingersoll commented on SOLR-769:
--

Committed revision 776692.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering

2009-05-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137
 ] 

Grant Ingersoll edited comment on SOLR-769 at 5/20/09 6:42 AM:
---

Committed revision 776692.

Thanks to everyone who helped out, especially Carrot2 creators Dawid and 
Stanislaw.

  was (Author: gsingers):
Committed revision 776692.
  
 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-119) Create an uber jar for use on Amazon Elastic M/R, etc.

2009-05-19 Thread Grant Ingersoll (JIRA)
Create an uber jar for use on Amazon Elastic M/R, etc.
--

 Key: MAHOUT-119
 URL: https://issues.apache.org/jira/browse/MAHOUT-119
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor


Some cloud resources have problems loading classes across JARs in the Job jar.  
See 
http://www.lucidimagination.com/search/document/3a5680dfe567d812/running_dirichlet_example_on_aemr

This can be fixed by adding a new target that creates a single Jar target.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-120) Site search powered by Solr

2009-05-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-120:
---

Attachment: MAHOUT-120.patch

Patch to change the site skin.  This was created by slightly modifying the 
default Forrest skin.

 Site search powered by Solr
 ---

 Key: MAHOUT-120
 URL: https://issues.apache.org/jira/browse/MAHOUT-120
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-120.patch


 For a number of years now, the Lucene community has been criticized for not 
 eating our own dog food when it comes to search.  My company has built and 
 hosts a site search (http://www.lucidimagination.com/search) that is powered 
 by Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
 community.   Additionally, it allows one to search all of the Mahout content 
 from a single place, including web, wiki, JIRA and mail archives.   See also 
 http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
 A preview of the site is available at 
 http://people.apache.org/~gsingers/mahout/site/publish/
 Lucid has a fault tolerant setup with replication and fail over as well as 
 monitoring services in place.  We are committed to maintaining and expanding 
 the search capabilities on the site.
 The following patch adds a skin to the Forrest site that enables the Mahout 
 site to search Mahout only content using Lucene/Solr.  When a search is 
 submitted, it automatically selects the Mahout facet such that only Mahout 
 content is searched.  From there, users can then narrow/broaden their search 
 criteria.
 I'm submitting this patch to Mahout first, as we'd like to roll out our 
 capabilities to some of the smaller communities first and then broaden to the 
 rest of the Lucene ecosystem.
 I plan on committing in a 3 or 4 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-120) Site search powered by Solr

2009-05-19 Thread Grant Ingersoll (JIRA)
Site search powered by Solr
---

 Key: MAHOUT-120
 URL: https://issues.apache.org/jira/browse/MAHOUT-120
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: MAHOUT-120.patch

For a number of years now, the Lucene community has been criticized for not 
eating our own dog food when it comes to search.  My company has built and 
hosts a site search (http://www.lucidimagination.com/search) that is powered by 
Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
community.   Additionally, it allows one to search all of the Mahout content 
from a single place, including web, wiki, JIRA and mail archives.   See also 
http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org

A preview of the site is available at 
http://people.apache.org/~gsingers/mahout/site/publish/

Lucid has a fault tolerant setup with replication and fail over as well as 
monitoring services in place.  We are committed to maintaining and expanding 
the search capabilities on the site.

The following patch adds a skin to the Forrest site that enables the Mahout 
site to search Mahout only content using Lucene/Solr.  When a search is 
submitted, it automatically selects the Mahout facet such that only Mahout 
content is searched.  From there, users can then narrow/broaden their search 
criteria.

I'm submitting this patch to Mahout first, as we'd like to roll out our 
capabilities to some of the smaller communities first and then broaden to the 
rest of the Lucene ecosystem.

I plan on committing in a 3 or 4 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-761) Fix Flare license headers

2009-05-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710870#action_12710870
 ] 

Grant Ingersoll commented on SOLR-761:
--

FYI: ant rat-sources is helpful in easily identifying which files are missing 
license headers.

 Fix Flare license headers
 -

 Key: SOLR-761
 URL: https://issues.apache.org/jira/browse/SOLR-761
 Project: Solr
  Issue Type: Task
Reporter: Erik Hatcher
Assignee: Erik Hatcher
 Fix For: 1.4


 Solr Flare has inconsistent use of the Apache Software License header in its 
 files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-05-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-769:
-

Attachment: SOLR-769.patch

OK, I think all the ducks are in a row.  

I intend to commit on Friday.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708534#action_12708534
 ] 

Grant Ingersoll commented on SOLR-773:
--

I think, and correct me if I'm wrong, that one of the things that often happens 
with geo stuff is that there are a lot of unique values.  This often has memory 
ramifications when using with FunctionQueries since most ValueSources uninvert 
the field.

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
 SOLR-773.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708534#action_12708534
 ] 

Grant Ingersoll edited comment on SOLR-773 at 5/12/09 11:02 AM:


I think, and correct me if I'm wrong, that one of the things that often happens 
with geo stuff is that there are a lot of unique values.  This often has memory 
ramifications when using with FunctionQueries since most ValueSources uninvert 
the field.

Otherwise, I like the sounds of Yonik's proposal as well.

  was (Author: gsingers):
I think, and correct me if I'm wrong, that one of the things that often 
happens with geo stuff is that there are a lot of unique values.  This often 
has memory ramifications when using with FunctionQueries since most 
ValueSources uninvert the field.
  
 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
 SOLR-773.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708541#action_12708541
 ] 

Grant Ingersoll commented on SOLR-773:
--

Also, how does the TrieRange stuff factor into this?

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
 SOLR-773.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708680#action_12708680
 ] 

Grant Ingersoll commented on SOLR-773:
--

{quote}
1) What is the goal we want to achieve?

* Provide a first iteration of a geographical search entity to SOLR
* Bring an external popular plugin, in out of the cold into ASF and SOLR, 
helps solr users out, increases developers from 1 to many.
{quote}

Agreed on the first, not 100% certain on the second.  On the second, this issue 
is the gate keeper.  If people reviewing the patch feel there are better ways 
to do things, then we should work through them before committing.  What you are 
effectively seeing is an increase in the developers working on from 1 to many, 
it's just not on committed code.

{quote}
2) What is the level of commitment, and road map of spatial solutions in lucene 
and solr?

* The primary goal of SOLR is as a text search engine, not GIS search, 
there are other and better ways to do that
  without reinventing the wheel and shoe horn-ing it into lucene.
  (e.g. persistent doc id mappings that can be referenced outside of 
lucene, so things like postGis and other tools can be used)
* We can never fully solve everyone's needs at once, lets start with what 
we have, and iterate upon it.
* I'm happy for any improvements as long as they keep to two goals A. don't 
make it stupid B. don't make it complex.
{quote}

On the first point, I don't follow.  Isn't LocalLucene and LocalSolr, just 
exactly a GIS search capability for Lucene/Solr?  I'm not sure if I would 
categorize it as shoe-horning.  There are many things that Lucene/Solr can 
power, GIS search is one of them.  By committing this patch (or some 
variation), we are saying Solr is going to support GIS search.  Of course, 
there are other ways to do it, but that doesn't preclude it from L/S.  The 
combination of text search plus GIS search is very powerful, as you know.  

Still, I think Yonik's main point is why reinvent the wheel when it comes to 
things like distributed search and the need for custom code for indexing, etc. 
when they likely can be handled through function queries and field types and 
therefore all of Solr's current functionality would just work.  The other 
capabilities (like sorting by a FunctionQuery) is icing on the cake that helps 
solve other problems as well.

Totally agree on the other points.  Also very cool to see the benchmarking info.


 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
 SOLR-773.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708680#action_12708680
 ] 

Grant Ingersoll edited comment on SOLR-773 at 5/12/09 4:21 PM:
---

{quote}
1) What is the goal we want to achieve?

* Provide a first iteration of a geographical search entity to SOLR
* Bring an external popular plugin, in out of the cold into ASF and SOLR, 
helps solr users out, increases developers from 1 to many.
{quote}

Agreed on the first, not 100% certain on the second.  On the second, this issue 
is the gate keeper.  If people reviewing the patch feel there are better ways 
to do things, then we should work through them before committing.  What you are 
effectively seeing is an increase in the developers working on from 1 to many, 
it's just not on committed code.

{quote}
2) What is the level of commitment, and road map of spatial solutions in lucene 
and solr?

* The primary goal of SOLR is as a text search engine, not GIS search, 
there are other and better ways to do that
  without reinventing the wheel and shoe horn-ing it into lucene.
  (e.g. persistent doc id mappings that can be referenced outside of 
lucene, so things like postGis and other tools can be used)
* We can never fully solve everyone's needs at once, lets start with what 
we have, and iterate upon it.
* I'm happy for any improvements as long as they keep to two goals A. don't 
make it stupid B. don't make it complex.
{quote}

On the first point, I don't follow.  Isn't LocalLucene and LocalSolr, just 
exactly a GIS search capability for Lucene/Solr?  I'm not sure if I would 
categorize it as shoe-horning.  There are many things that Lucene/Solr can 
power, GIS search with text is one of them.  By committing this patch (or some 
variation), we are saying Solr is going to support it.  Of course, there are 
other ways to do it, but that doesn't preclude it from L/S.  The combination of 
text search plus GIS search is very powerful, as you know.  

Still, I think Yonik's main point is why reinvent the wheel when it comes to 
things like distributed search and the need for custom code for indexing, etc. 
when they likely can be handled through function queries and field types and 
therefore all of Solr's current functionality would just work.  The other 
capabilities (like sorting by a FunctionQuery) is icing on the cake that helps 
solve other problems as well.

Totally agree on the other points.  Also very cool to see the benchmarking info.


  was (Author: gsingers):
{quote}
1) What is the goal we want to achieve?

* Provide a first iteration of a geographical search entity to SOLR
* Bring an external popular plugin, in out of the cold into ASF and SOLR, 
helps solr users out, increases developers from 1 to many.
{quote}

Agreed on the first, not 100% certain on the second.  On the second, this issue 
is the gate keeper.  If people reviewing the patch feel there are better ways 
to do things, then we should work through them before committing.  What you are 
effectively seeing is an increase in the developers working on from 1 to many, 
it's just not on committed code.

{quote}
2) What is the level of commitment, and road map of spatial solutions in lucene 
and solr?

* The primary goal of SOLR is as a text search engine, not GIS search, 
there are other and better ways to do that
  without reinventing the wheel and shoe horn-ing it into lucene.
  (e.g. persistent doc id mappings that can be referenced outside of 
lucene, so things like postGis and other tools can be used)
* We can never fully solve everyone's needs at once, lets start with what 
we have, and iterate upon it.
* I'm happy for any improvements as long as they keep to two goals A. don't 
make it stupid B. don't make it complex.
{quote}

On the first point, I don't follow.  Isn't LocalLucene and LocalSolr, just 
exactly a GIS search capability for Lucene/Solr?  I'm not sure if I would 
categorize it as shoe-horning.  There are many things that Lucene/Solr can 
power, GIS search is one of them.  By committing this patch (or some 
variation), we are saying Solr is going to support GIS search.  Of course, 
there are other ways to do it, but that doesn't preclude it from L/S.  The 
combination of text search plus GIS search is very powerful, as you know.  

Still, I think Yonik's main point is why reinvent the wheel when it comes to 
things like distributed search and the need for custom code for indexing, etc. 
when they likely can be handled through function queries and field types and 
therefore all of Solr's current functionality would just work.  The other 
capabilities (like sorting by a FunctionQuery) is icing on the cake that helps 
solve other problems as well.

Totally agree on the other points.  Also very cool to see the benchmarking info.

  
 Incorporate Local 

[jira] Commented: (LUCENE-1387) Add LocalLucene

2009-05-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708178#action_12708178
 ] 

Grant Ingersoll commented on LUCENE-1387:
-

FWIW, you might find the discussion on SOLR-773 interesting.

 Add LocalLucene
 ---

 Key: LUCENE-1387
 URL: https://issues.apache.org/jira/browse/LUCENE-1387
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: Grant Ingersoll
Assignee: Ryan McKinley
Priority: Minor
 Fix For: 2.9

 Attachments: spatial-lucene.zip, spatial.tar.gz, spatial.zip


 Local Lucene (Geo-search) has been donated to the Lucene project, per 
 https://issues.apache.org/jira/browse/INCUBATOR-77.  This issue is to handle 
 the Lucene portion of integration.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1138) Query Elevation Component should gracefully handle empty queries

2009-05-04 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705559#action_12705559
 ] 

Grant Ingersoll commented on SOLR-1138:
---

Committed revision 771268.

 Query Elevation Component should gracefully handle empty queries
 

 Key: SOLR-1138
 URL: https://issues.apache.org/jira/browse/SOLR-1138
 Project: Solr
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1138.patch


 From http://www.lucidimagination.com/search/document/3b50cd3506952f7 :
 {quote}
 In the QueryElevComponent (QEC) it currently throws an exception if  
 the input Query is null (line 329).Additionally, I've seen cases  
 where it's possible that the Query is not null (q is not set, but  
 q.alt is *:*), but the rb.getQueryString() is null, which causes an  
 NPE on line 300 or so.
 I'd like to suggest that if the Query is empty/null, the QEC should  
 just go on it's merry way as if there is nothing to do.  I don't think  
 a lack of query means that the QEC is improperly configured, as the  
 exception message implies:
   The QueryElevationComponent needs to be registered 'after' the query  
 component
 We should be making sure the QEC is properly registered during  
 initialization time.
 Thoughts?
 -Grant{quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1138) Query Elevation Component should gracefully handle empty queries

2009-05-03 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1138:
--

Attachment: SOLR-1138.patch

Here's a patch that fixes this.  I plan on committing today.

 Query Elevation Component should gracefully handle empty queries
 

 Key: SOLR-1138
 URL: https://issues.apache.org/jira/browse/SOLR-1138
 Project: Solr
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1138.patch


 From http://www.lucidimagination.com/search/document/3b50cd3506952f7 :
 {quote}
 In the QueryElevComponent (QEC) it currently throws an exception if  
 the input Query is null (line 329).Additionally, I've seen cases  
 where it's possible that the Query is not null (q is not set, but  
 q.alt is *:*), but the rb.getQueryString() is null, which causes an  
 NPE on line 300 or so.
 I'd like to suggest that if the Query is empty/null, the QEC should  
 just go on it's merry way as if there is nothing to do.  I don't think  
 a lack of query means that the QEC is improperly configured, as the  
 exception message implies:
   The QueryElevationComponent needs to be registered 'after' the query  
 component
 We should be making sure the QEC is properly registered during  
 initialization time.
 Thoughts?
 -Grant{quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1138) Query Elevation Component should gracefully handle empty queries

2009-04-30 Thread Grant Ingersoll (JIRA)
Query Elevation Component should gracefully handle empty queries


 Key: SOLR-1138
 URL: https://issues.apache.org/jira/browse/SOLR-1138
 Project: Solr
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor


From http://www.lucidimagination.com/search/document/3b50cd3506952f7 :
{quote}
In the QueryElevComponent (QEC) it currently throws an exception if  
the input Query is null (line 329).Additionally, I've seen cases  
where it's possible that the Query is not null (q is not set, but  
q.alt is *:*), but the rb.getQueryString() is null, which causes an  
NPE on line 300 or so.

I'd like to suggest that if the Query is empty/null, the QEC should  
just go on it's merry way as if there is nothing to do.  I don't think  
a lack of query means that the QEC is improperly configured, as the  
exception message implies:
The QueryElevationComponent needs to be registered 'after' the query  
component

We should be making sure the QEC is properly registered during  
initialization time.

Thoughts?

-Grant{quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1128) Solr Cell Extract Only should also return Metadata too

2009-04-24 Thread Grant Ingersoll (JIRA)
Solr Cell Extract Only should also return Metadata too
--

 Key: SOLR-1128
 URL: https://issues.apache.org/jira/browse/SOLR-1128
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


Just as the title says.  When using extract.only, we should also include the 
Metadata in the response

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1128) Solr Cell Extract Only should also return Metadata too

2009-04-24 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved SOLR-1128.
---

Resolution: Fixed

Committed revision 768281.

 Solr Cell Extract Only should also return Metadata too
 --

 Key: SOLR-1128
 URL: https://issues.apache.org/jira/browse/SOLR-1128
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


 Just as the title says.  When using extract.only, we should also include the 
 Metadata in the response

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1099) FieldAnalysisRequestHandler

2009-04-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700782#action_12700782
 ] 

Grant Ingersoll commented on SOLR-1099:
---

So, why not just fold all of this into the ARH?  Seems like all of these 
features work just as well as input parameters and there is no need for 
deprecation, etc.  

 FieldAnalysisRequestHandler
 ---

 Key: SOLR-1099
 URL: https://issues.apache.org/jira/browse/SOLR-1099
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 1.3
Reporter: Uri Boness
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: AnalisysRequestHandler_refactored.patch, 
 analysis_request_handlers_incl_solrj.patch, 
 AnalysisRequestHandler_refactored1.patch, 
 FieldAnalysisRequestHandler_incl_test.patch


 The FieldAnalysisRequestHandler provides the analysis functionality of the 
 web admin page as a service. This handler accepts a filetype/fieldname 
 parameter and a value and as a response returns a breakdown of the analysis 
 process. It is also possible to send a query value which will use the 
 configured query analyzer as well as a showmatch parameter which will then 
 mark every matched token as a match.
 If this handler is added to the code base, I also recommend to rename the 
 current AnalysisRequestHandler to DocumentAnalysisRequestHandler and have 
 them both inherit from one AnalysisRequestHandlerBase class which provides 
 the common functionality of the analysis breakdown and its translation to 
 named lists. This will also enhance the current AnalysisRequestHandler which 
 right now is fairly simplistic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-1099) FieldAnalysisRequestHandler

2009-04-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700622#action_12700622
 ] 

Grant Ingersoll edited comment on SOLR-1099 at 4/19/09 6:02 PM:


Sorry for being a bit late...  
Am I understanding that the main thing this does is allow you to specify one 
input and get back analysis for each field you specify?  Well, that and the 
GET, right?

  was (Author: gsingers):
Sorry for being a bit late...  
Am I understand that the main thing this does is allow you to specify one input 
and get back analysis for each field you specify?
  
 FieldAnalysisRequestHandler
 ---

 Key: SOLR-1099
 URL: https://issues.apache.org/jira/browse/SOLR-1099
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 1.3
Reporter: Uri Boness
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: AnalisysRequestHandler_refactored.patch, 
 AnalysisRequestHandler_refactored1.patch, 
 FieldAnalysisRequestHandler_incl_test.patch


 The FieldAnalysisRequestHandler provides the analysis functionality of the 
 web admin page as a service. This handler accepts a filetype/fieldname 
 parameter and a value and as a response returns a breakdown of the analysis 
 process. It is also possible to send a query value which will use the 
 configured query analyzer as well as a showmatch parameter which will then 
 mark every matched token as a match.
 If this handler is added to the code base, I also recommend to rename the 
 current AnalysisRequestHandler to DocumentAnalysisRequestHandler and have 
 them both inherit from one AnalysisRequestHandlerBase class which provides 
 the common functionality of the analysis breakdown and its translation to 
 named lists. This will also enhance the current AnalysisRequestHandler which 
 right now is fairly simplistic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1099) FieldAnalysisRequestHandler

2009-04-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700622#action_12700622
 ] 

Grant Ingersoll commented on SOLR-1099:
---

Sorry for being a bit late...  
Am I understand that the main thing this does is allow you to specify one input 
and get back analysis for each field you specify?

 FieldAnalysisRequestHandler
 ---

 Key: SOLR-1099
 URL: https://issues.apache.org/jira/browse/SOLR-1099
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 1.3
Reporter: Uri Boness
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: AnalisysRequestHandler_refactored.patch, 
 AnalysisRequestHandler_refactored1.patch, 
 FieldAnalysisRequestHandler_incl_test.patch


 The FieldAnalysisRequestHandler provides the analysis functionality of the 
 web admin page as a service. This handler accepts a filetype/fieldname 
 parameter and a value and as a response returns a breakdown of the analysis 
 process. It is also possible to send a query value which will use the 
 configured query analyzer as well as a showmatch parameter which will then 
 mark every matched token as a match.
 If this handler is added to the code base, I also recommend to rename the 
 current AnalysisRequestHandler to DocumentAnalysisRequestHandler and have 
 them both inherit from one AnalysisRequestHandlerBase class which provides 
 the common functionality of the analysis breakdown and its translation to 
 named lists. This will also enhance the current AnalysisRequestHandler which 
 right now is fairly simplistic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700628#action_12700628
 ] 

Grant Ingersoll commented on SOLR-769:
--

Where can we download nni.jar from?  

Seems like if you only need two classes it would be easy enough to replace them 
with your own code.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-04-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-769:
-

Comment: was deleted

(was: Where can we download nni.jar from?  

Seems like if you only need two classes it would be easy enough to replace them 
with your own code.)

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-04-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-769:
-

Attachment: SOLR-769.tar
SOLR-769.patch

OK, I think this is ready to go, except I still need to double check how it 
works with release.   Since we can't distribute LGPL, this is going to have to 
be a source only release artifact and thus can never be in the WAR, 
unfortunately.

The tarball contains the JAR files that one needs, with the exception of the 
LGPL deps which are downloaded from the approp. places.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699908#action_12699908
 ] 

Grant Ingersoll commented on SOLR-769:
--

Looks like we need to make the NNI JAR be a download, too, right?  It appears 
to be LGPL.  Where does that library come from, anyway?  I don't see it on 
Carrot trunk, but it is in the zip.  And a search for it doesn't reveal much.

-Grant

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (LUCENE-1588) Update Spatial Lucene sort to use FieldComparatorSource

2009-04-15 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-1588.
-

Resolution: Fixed

This was committed.

 Update Spatial Lucene sort to use FieldComparatorSource
 ---

 Key: LUCENE-1588
 URL: https://issues.apache.org/jira/browse/LUCENE-1588
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 2.9
Reporter: patrick o'leary
Assignee: patrick o'leary
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1588.patch


 Update distance sorting to use FieldComparator sorting as opposed to 
 SortComparator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (SOLR-773) Incorporate Local Lucene/Solr

2009-04-15 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-773:


Assignee: Grant Ingersoll

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-04-15 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699329#action_12699329
 ] 

Grant Ingersoll commented on SOLR-773:
--

We should be able to incorporate the GeoHash stuff in Lucene now, right?  I'm 
not spatial expert, but this means we could have an update processor that only 
uses one field, right?

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-04-15 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699364#action_12699364
 ] 

Grant Ingersoll commented on SOLR-773:
--

OK, so color me a total geo newbie, but...

So, if I index the spatial.xml in the patch I just submitted and I execute:
{code}
http://localhost:8983/solr/select?q=name:five
{code}

I get one result, which is expected.

If I then do a geo search:
{code}
http://localhost:8983/solr/select?q=name:fiveqt=geolong=-74.0093994140625lat=40.75141843299745radius=100debugQuery=true
{code}

I get two results.   The second result is the other theater in the spatial.xml 
file.  Yet, it does not contain the value five in the name field even though 
it meets the spatial search criteria.

Shouldn't there just be one result?  What am I not understanding?

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
 spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-04-15 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699403#action_12699403
 ] 

Grant Ingersoll commented on SOLR-773:
--

OK, I think I understand why it does this, but it seems a little odd to me.  
The reason is due to the fact that the geo handler uses the geo QParser, which 
ignores the query parameter and produces a query solely based on the lat/lon 
information.  

Like I said, I'm a newbie to geo search, but it seems like the QParser should 
delegate the parsing of the q param to some other parser and then it would only 
do distance calculations on the docset returned from the QueryComponent.  Of 
course, I guess one could ask what the semantics are of combining a text query 
with a spatial query, but I would suppose we could combine them with either AND 
or OR, right, such that if I OR'd them together, I would get all docs matching 
the query term OR'd with all docs in the bounding box.  Similarily, AND would 
yield all docs with the term in the bounding box, right?

Again, I am likely missing something, so bear with me.

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
 spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-04-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698429#action_12698429
 ] 

Grant Ingersoll commented on SOLR-773:
--

This latest patch doesn't compile b/c it is missing the SpatialParams class.

 Incorporate Local Lucene/Solr
 -

 Key: SOLR-773
 URL: https://issues.apache.org/jira/browse/SOLR-773
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
 SOLR-773-local-lucene.patch, spatial-solr.tar.gz


 Local Lucene has been donated to the Lucene project.  It has some Solr 
 components, but we should evaluate how best to incorporate it into Solr.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-804) include lucene misc jar in solr distro

2009-04-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved SOLR-804.
--

Resolution: Fixed

Committed revision 764580.

Added lucene-misc-2.9-dev.jar from rev 764281 which should match the Lucene 
version on trunk.

 include lucene misc jar in solr distro
 --

 Key: SOLR-804
 URL: https://issues.apache.org/jira/browse/SOLR-804
 Project: Solr
  Issue Type: Wish
Affects Versions: 1.3
 Environment: all
Reporter: solrize
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


 It would be useful to have the lucene misc jar file included with solr.  My 
 immediate goal is to build several solr indexes in parallel on separate 
 servers, then run the index merge utility at the end to combine them into a 
 single index.  Erik H suggested I post an issue requesting including the misc 
 jar with solr.  Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-804) include lucene misc jar in solr distro

2009-04-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-804:
-

Fix Version/s: (was: 1.5)
   1.4

 include lucene misc jar in solr distro
 --

 Key: SOLR-804
 URL: https://issues.apache.org/jira/browse/SOLR-804
 Project: Solr
  Issue Type: Wish
Affects Versions: 1.3
 Environment: all
Reporter: solrize
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4


 It would be useful to have the lucene misc jar file included with solr.  My 
 immediate goal is to build several solr indexes in parallel on separate 
 servers, then run the index merge utility at the end to combine them into a 
 single index.  Erik H suggested I post an issue requesting including the misc 
 jar with solr.  Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (LUCENE-1567) New flexible query parser

2009-04-10 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-1567:
---

Assignee: Grant Ingersoll  (was: Michael Busch)

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

<    11   12   13   14   15   16   17   18   19   20   >