[jira] Issue Comment Edited: (SOLR-1499) SolrEntityProcessor - DIH EntityProcessor that queries an external Solr via SolrJ

2010-02-25 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838711#action_12838711
 ] 

Lance Norskog edited comment on SOLR-1499 at 2/26/10 5:09 AM:
--

Add error-handling. Correctly handles skip, continue and abort.
Add unit tests for error-handling.
Rename unit tests for more clarity.

Still has the flaw that all attributes are evaluated at the beginning. 
It is not thread-safe.

Includes one non-backwards-compatible change: the 'solr' attribute is now 'url' 
to maintain consistency with the rest of the DIH. 

  was (Author: lancenorskog):
Add error-handling. Correctly handles skip, continue and abort.
Add unit tests for error-handling.
Rename unit tests for more clarity.

Still has the flaw that all attributes are evaluated at the beginning. 
It is not thread-safe.
  
> SolrEntityProcessor - DIH EntityProcessor that queries an external Solr via 
> SolrJ
> -
>
> Key: SOLR-1499
> URL: https://issues.apache.org/jira/browse/SOLR-1499
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: Lance Norskog
>Assignee: Erik Hatcher
> Fix For: 1.5
>
> Attachments: SOLR-1499.patch, SOLR-1499.patch, SOLR-1499.patch, 
> SOLR-1499.patch, SOLR-1499.patch
>
>
> The SolrEntityProcessor queries an external Solr instance. The Solr documents 
> returned are unpacked and emitted as DIH fields.
> The SolrEntityProcessor uses the following attributes:
> * solr='http://localhost:8983/solr/sms'
> ** This gives the URL of the target Solr instance.
> *** Note: the connection to the target Solr uses the binary SolrJ format.
> * query='Jefferson&sort=id+asc'
> ** This gives the base query string use with Solr. It can include any 
> standard Solr request parameter. This attribute is processed under the 
> variable resolution rules and can be driven in an inner stage of the indexing 
> pipeline.
> * rows='10'
> ** This gives the number of rows to fetch per request..
> ** The SolrEntityProcessor always fetches every document that matches the 
> request..
> * fields='id,tag'
> ** This selects the fields to be returned from the Solr request.
> ** These must also be declared as  elements.
> ** As with all fields, template processors can be used to alter the contents 
> to be passed downwards.
> * timeout='30'
> ** This limits the query to 5 seconds. This can be used as a fail-safe to 
> prevent the indexing session from freezing up. By default the timeout is 5 
> minutes.
> Limitations:
> * Solr errors are not handled correctly.
> * Loop control constructs have not been tested.
> * Multi-valued returned fields have not been tested.
> The unit tests give examples of how to use it as the root entity and an inner 
> entity.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1499) SolrEntityProcessor - DIH EntityProcessor that queries an external Solr via SolrJ

2010-02-25 Thread Lance Norskog (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-1499:


Attachment: SOLR-1499.patch

Add error-handling. Correctly handles skip, continue and abort.
Add unit tests for error-handling.
Rename unit tests for more clarity.

Still has the flaw that all attributes are evaluated at the beginning. 
It is not thread-safe.

> SolrEntityProcessor - DIH EntityProcessor that queries an external Solr via 
> SolrJ
> -
>
> Key: SOLR-1499
> URL: https://issues.apache.org/jira/browse/SOLR-1499
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: Lance Norskog
>Assignee: Erik Hatcher
> Fix For: 1.5
>
> Attachments: SOLR-1499.patch, SOLR-1499.patch, SOLR-1499.patch, 
> SOLR-1499.patch, SOLR-1499.patch
>
>
> The SolrEntityProcessor queries an external Solr instance. The Solr documents 
> returned are unpacked and emitted as DIH fields.
> The SolrEntityProcessor uses the following attributes:
> * solr='http://localhost:8983/solr/sms'
> ** This gives the URL of the target Solr instance.
> *** Note: the connection to the target Solr uses the binary SolrJ format.
> * query='Jefferson&sort=id+asc'
> ** This gives the base query string use with Solr. It can include any 
> standard Solr request parameter. This attribute is processed under the 
> variable resolution rules and can be driven in an inner stage of the indexing 
> pipeline.
> * rows='10'
> ** This gives the number of rows to fetch per request..
> ** The SolrEntityProcessor always fetches every document that matches the 
> request..
> * fields='id,tag'
> ** This selects the fields to be returned from the Solr request.
> ** These must also be declared as  elements.
> ** As with all fields, template processors can be used to alter the contents 
> to be passed downwards.
> * timeout='30'
> ** This limits the query to 5 seconds. This can be used as a fail-safe to 
> prevent the indexing session from freezing up. By default the timeout is 5 
> minutes.
> Limitations:
> * Solr errors are not handled correctly.
> * Loop control constructs have not been tested.
> * Multi-valued returned fields have not been tested.
> The unit tests give examples of how to use it as the root entity and an inner 
> entity.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-25 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Backing a core up works, at least according to the test case... I will probably 
begin to test this patch in a staging environment next, where Zookeeper is run 
in it's own process and a real HDFS cluster is used.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-25 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Zipping from a Lucene directory works and has a test case

A ReplicationHandler is added by default under a unique name, if one exists 
already, we still create our own, for the express purpose of locking an index 
commit point, zipping it, then uploading it to, for example, HDFS.  This part 
will likely be written next.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Some odd wiki stuff on the SolrRequestHandler page

2010-02-25 Thread Mark Miller

The SolrRequestHandler page is generating some odd wiki links:

Under List of Request Handlers Available we use:

<-title:CategorySolrRequestHandler)>>


But thats bring up a non relevant german and what looks like japanese page:

  1. CategoryCategory
 
  2. DataImportHandler
 
  3. DisMaxRequestHandler
 

  4. HilfeZuMakros
 
  5. LukeRequestHandler
 
  6. MoreLikeThisHandler
 

  7. SearchHandler
 
  8. SpellCheckerRequestHandler
 

  9. 
 


The content of each page is not correct for this set of links.

Just noting in case someone more familiar with this wants to take care 
of it. Offhand I'm not sure if something needs to be changed about those 
pages, or if we need to change how we are listing them.


--
- Mark

http://www.lucidimagination.com





[jira] Commented: (SOLR-1364) Distributed search return Solr shard header information (like qtime)

2010-02-25 Thread ian connor (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838557#action_12838557
 ] 

ian connor commented on SOLR-1364:
--

Good idea. Here is some sample out put on a solr stats page.

description: The standard Solr request handler
stats:
handlerStart : 1267037173182
requests : 3099817
errors : 2221
timeouts : 0
totalTime : 52082310
avgTimePerRequest : 16.801737
standardDeviation : 304.06168
avgRequestsPerSecond : 31.80935
10.0.16.181:8895/solr_numRequests : 12869
10.0.16.181:8884/solr_averageQTime : 9.651404
10.0.16.181:8896/solr_queryTime : 314763
10.0.16.181:8896/solr_elapsedTime : 681581
10.0.16.181:8885/solr_elapsedTime : 193555
10.0.16.181:8882/solr_elapsedTime : 329673
10.0.16.181:8898/solr_elapsedTime : 519454
10.0.16.181:8896/solr_queryQTime : 0
10.0.16.181:8897/solr_numRequests : 15344
10.0.16.181:8885/solr_queryTime : 62374
10.0.16.181:8891/solr_elapsedTime : 549124
10.0.16.181:8884/solr_elapsedTime : 367643
10.0.16.181:8898/solr_queryTime : 183239
10.0.16.181:8885/solr_averageQTime : 11.983478
10.0.16.181:/solr_averageQTime : 52.41645
10.0.16.181:8892/solr_averageQTime : 25.302937
10.0.16.181:8887/solr_queryQTime : 101
etc.

>From this sample, we can see 10.0.16.181:8892 average is only 25ms where 
>10.0.16.181: is at 52ms (twice as much) so we might consider rebalancing 
>the shards to give 10.0.16.181: less work.

Ideally, it would be good to get individual times to also go back on the 
request to the client so that these could be tracked and to see if any patterns 
emerge (certain queries hurt certain shards or the time might correspond to a 
replication event). If someone wants to take these times and put on them on the 
request - that would be brilliant (I did not figure that part out yet).

> Distributed search return Solr shard header information (like qtime)
> 
>
> Key: SOLR-1364
> URL: https://issues.apache.org/jira/browse/SOLR-1364
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1364.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Distributed queries can expose the Solr shard query information
> such as qtime. The aggregate qtime can be broken up into the
> time required for each stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1795) Subclassing QueryComponent for fetching results from a database

2010-02-25 Thread Dallan Quass (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dallan Quass updated SOLR-1795:
---

Priority: Minor  (was: Major)

> Subclassing QueryComponent for fetching results from a database
> ---
>
> Key: SOLR-1795
> URL: https://issues.apache.org/jira/browse/SOLR-1795
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 1.4
>Reporter: Dallan Quass
>Priority: Minor
>
> This is a request to change the access on a few fields from package to public.
> I've subclassed QueryComponent to allow me to fetch results from a database 
> (based upon the stored uniqueKey field) instead of from the shards. The only 
> stored field in solr is the uniqueKey field, and whatever fields I might need 
> for sorting.  To do this I've overridden QueryComponent.finishStage so that 
> after executing the query, SolrDocuments are created with the uniqueKey 
> field.  A later component populates the rest of the fields in the documents 
> by reading them from a database.
> {code}
> public void finishStage(ResponseBuilder rb) {
>if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY) {
>   // Create SolrDocument's from the ShardDoc's
>   boolean returnScores = (rb.getFieldFlags() & 
> SolrIndexSearcher.GET_SCORES) != 0;
>   for (ShardDoc sdoc : rb.resultIds.values()) {
>  SolrDocument doc = new SolrDocument();
>  doc.setField("id", sdoc.id);
>  if (returnScores && sdoc.score != null) {
> doc.setField("score", sdoc.score);
>  }
>  rb._responseDocs.set(sdoc.positionInResponse, doc);
>   }
>}
> }
> {code}
> Everything works fine, but ResponseBuilder variables: *resultIds* and 
> *_responseDocs*, and ShardDoc variables: *id*, *score*, and 
> *positionInResponse* currently all have package visibility.  I needed to 
> modify the core solr files to change their visibility to public so that I 
> could access them in the function above. Is there any chance that they could 
> be changed to public in a future version of Solr, or somehow make them 
> accessible outside the package?
> If people are interested, I could post the QueryComponent subclass and 
> database component that I wrote. But it gets a bit involved because the 
> QueryComponent subclass also handles parsing the query just at the main solr 
> server, and sending serialized parsed queries to the shards.  (Query parsing 
> in my environment is pretty cpu- and memory-intensive so I do it just at the 
> main server instead of the shards.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1795) Subclassing QueryComponent for fetching results from a database

2010-02-25 Thread Dallan Quass (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dallan Quass updated SOLR-1795:
---

Description: 
This is a request to change the access on a few fields from package to public.

I've subclassed QueryComponent to allow me to fetch results from a database 
(based upon the stored uniqueKey field) instead of from the shards. The only 
stored field in solr is the uniqueKey field, and whatever fields I might need 
for sorting.  To do this I've overridden QueryComponent.finishStage so that 
after executing the query, SolrDocuments are created with the uniqueKey field.  
A later component populates the rest of the fields in the documents by reading 
them from a database.

{code}
public void finishStage(ResponseBuilder rb) {
   if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY) {
  // Create SolrDocument's from the ShardDoc's
  boolean returnScores = (rb.getFieldFlags() & 
SolrIndexSearcher.GET_SCORES) != 0;
  for (ShardDoc sdoc : rb.resultIds.values()) {
 SolrDocument doc = new SolrDocument();
 doc.setField("id", sdoc.id);
 if (returnScores && sdoc.score != null) {
doc.setField("score", sdoc.score);
 }
 rb._responseDocs.set(sdoc.positionInResponse, doc);
  }
   }
}
{code}

Everything works fine, but ResponseBuilder variables: *resultIds* and 
*_responseDocs*, and ShardDoc variables: *id*, *score*, and 
*positionInResponse* currently all have package visibility.  I needed to modify 
the core solr files to change their visibility to public so that I could access 
them in the function above. Is there any chance that they could be changed to 
public in a future version of Solr, or somehow make them accessible outside the 
package?

If people are interested, I could post the QueryComponent subclass and database 
component that I wrote. But it gets a bit involved because the QueryComponent 
subclass also handles parsing the query just at the main solr server, and 
sending serialized parsed queries to the shards.  (Query parsing in my 
environment is pretty cpu- and memory-intensive so I do it just at the main 
server instead of the shards.)


  was:
This is a request to change the access on a few fields from package to public.

I've subclassed QueryComponent to allow me to fetch results from a database 
(based upon the stored uniqueKey field) instead of from the shards. The only 
stored field in solr is the uniqueKey field, and whatever fields I might need 
for sorting.  To do this I've overridden QueryComponent.finishStage so that 
after executing the query, SolrDocuments are created with the uniqueKey field.  
A later component populates the rest of the fields in the documents by reading 
them from a database.

{code}
public void finishStage(ResponseBuilder rb) {
   if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY) {
  // Create SolrDocument's from the ShardDoc's
  boolean returnScores = (rb.getFieldFlags() & 
SolrIndexSearcher.GET_SCORES) != 0;
  for (ShardDoc sdoc : rb.resultIds.values()) {
 SolrDocument doc = new SolrDocument();
 doc.setField(UNIQUE_KEY_FIELDNAME, sdoc.id);
 if (returnScores && sdoc.score != null) {
doc.setField("score", sdoc.score);
 }
 rb._responseDocs.set(sdoc.positionInResponse, doc);
  }
   }
}
{code}

Everything works fine, but ResponseBuilder variables: *resultIds* and 
*_responseDocs*, and ShardDoc variables: *id*, *score*, and 
*positionInResponse* currently all have package visibility.  I needed to modify 
the core solr files to change their visibility to public so that I could access 
them in the function above. Is there any chance that they could be changed to 
public in a future version of Solr, or somehow make them accessible outside the 
package?

If people are interested, I could post the QueryComponent subclass and database 
component that I wrote. But it gets a bit involved because the QueryComponent 
subclass also handles parsing the query just at the main solr server, and 
sending serialized parsed queries to the shards.  (Query parsing in my 
environment is pretty cpu- and memory-intensive so I do it just at the main 
server instead of the shards.)



> Subclassing QueryComponent for fetching results from a database
> ---
>
> Key: SOLR-1795
> URL: https://issues.apache.org/jira/browse/SOLR-1795
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 1.4
>Reporter: Dallan Quass
>
> This is a request to change the access on a few fields from package to public.
> I've subclassed QueryComponent to allow me to fetch results from a database 
> (based upon the stored uniqueKey field) instead of from the shards. The only 
> stored field in solr is the uniqueK

[jira] Created: (SOLR-1795) Subclassing QueryComponent for fetching results from a database

2010-02-25 Thread Dallan Quass (JIRA)
Subclassing QueryComponent for fetching results from a database
---

 Key: SOLR-1795
 URL: https://issues.apache.org/jira/browse/SOLR-1795
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Dallan Quass


This is a request to change the access on a few fields from package to public.

I've subclassed QueryComponent to allow me to fetch results from a database 
(based upon the stored uniqueKey field) instead of from the shards. The only 
stored field in solr is the uniqueKey field, and whatever fields I might need 
for sorting.  To do this I've overridden QueryComponent.finishStage so that 
after executing the query, SolrDocuments are created with the uniqueKey field.  
A later component populates the rest of the fields in the documents by reading 
them from a database.

{code}
public void finishStage(ResponseBuilder rb) {
   if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY) {
  // Create SolrDocument's from the ShardDoc's
  boolean returnScores = (rb.getFieldFlags() & 
SolrIndexSearcher.GET_SCORES) != 0;
  for (ShardDoc sdoc : rb.resultIds.values()) {
 SolrDocument doc = new SolrDocument();
 doc.setField(UNIQUE_KEY_FIELDNAME, sdoc.id);
 if (returnScores && sdoc.score != null) {
doc.setField("score", sdoc.score);
 }
 rb._responseDocs.set(sdoc.positionInResponse, doc);
  }
   }
}
{code}

Everything works fine, but ResponseBuilder variables: *resultIds* and 
*_responseDocs*, and ShardDoc variables: *id*, *score*, and 
*positionInResponse* currently all have package visibility.  I needed to modify 
the core solr files to change their visibility to public so that I could access 
them in the function above. Is there any chance that they could be changed to 
public in a future version of Solr, or somehow make them accessible outside the 
package?

If people are interested, I could post the QueryComponent subclass and database 
component that I wrote. But it gets a bit involved because the QueryComponent 
subclass also handles parsing the query just at the main solr server, and 
sending serialized parsed queries to the shards.  (Query parsing in my 
environment is pretty cpu- and memory-intensive so I do it just at the main 
server instead of the shards.)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-1069) CSV document and field boosting support

2010-02-25 Thread Dallan Quass (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838526#action_12838526
 ] 

Dallan Quass edited comment on SOLR-1069 at 2/25/10 8:33 PM:
-

FWIW, I made a few changes to CSVRequestHandler.java, which mainly involve 
extracting CSVLoader into a separate public class and making a few 
variables/functions visible outside the package.  The attached files show the 
changes I made.  

Doing this allowed me to create a subclass of CSVLoader that does boosting:

{code}
public class BoostingCSVRequestHandler extends ContentStreamHandlerBase {
   protected ContentStreamLoader newLoader(SolrQueryRequest req, 
UpdateRequestProcessor processor) {
  return new BoostingCSVLoader(req, processor);
   }

    SolrInfoMBeans methods //
   @Override
   public String getDescription() {
 return "boost CSV documents";
   }

   @Override
   public String getVersion() {
 return "";
   }

   @Override
   public String getSourceId() {
 return "";
   }

   @Override
   public String getSource() {
 return "";
   }
}

class BoostingCSVLoader extends CSVLoader {
   int boostFieldNum;

   BoostingCSVLoader(SolrQueryRequest req, UpdateRequestProcessor processor) {
  super(req, processor);
   }

   private String[] removeElement(String[] a, int pos) {
  String[] n = new String[a.length-1];
  if (pos > 0) System.arraycopy(a, 0, n, 0, pos);
  if (pos < n.length) System.arraycopy(a, pos+1, n, pos, n.length - pos);
  return n;
   }

   @Override
   protected void prepareFields() {
  boostFieldNum = -1;
  for (int i = 0; i < fieldnames.length; i++) {
 if (fieldnames[i].equals("boost")) {
boostFieldNum = i;
break;
 }
  }
  if (boostFieldNum >= 0) {
 fieldnames = removeElement(fieldnames, boostFieldNum);
  }

  super.prepareFields();
   }

   public void addDoc(int line, String[] vals) throws IOException {
  templateAdd.indexedId = null;
  SolrInputDocument doc = new SolrInputDocument();
  if (boostFieldNum >= 0) {
 float boost = Float.parseFloat(vals[boostFieldNum]);
 doc.setDocumentBoost(boost);
 vals = removeElement(vals, boostFieldNum);
  }

  doAdd(line, vals, doc, templateAdd);
   }
}
{code}

  was (Author: dallanq):
FWIW, I made a few changes to CSVRequestHandler.java, which mainly involve 
extracting CSVLoader into a separate public class and making a few 
variables/functions visible outside the package.  The attached files show the 
changes I made.  

Doing this allowed me to create a subclass of CSVLoader that does boosting:

public class BoostingCSVRequestHandler extends ContentStreamHandlerBase {
   protected ContentStreamLoader newLoader(SolrQueryRequest req, 
UpdateRequestProcessor processor) {
  return new BoostingCSVLoader(req, processor);
   }

    SolrInfoMBeans methods //
   @Override
   public String getDescription() {
 return "boost CSV documents";
   }

   @Override
   public String getVersion() {
 return "";
   }

   @Override
   public String getSourceId() {
 return "";
   }

   @Override
   public String getSource() {
 return "";
   }
}

class BoostingCSVLoader extends CSVLoader {
   int boostFieldNum;

   BoostingCSVLoader(SolrQueryRequest req, UpdateRequestProcessor processor) {
  super(req, processor);
   }

   private String[] removeElement(String[] a, int pos) {
  String[] n = new String[a.length-1];
  if (pos > 0) System.arraycopy(a, 0, n, 0, pos);
  if (pos < n.length) System.arraycopy(a, pos+1, n, pos, n.length - pos);
  return n;
   }

   @Override
   protected void prepareFields() {
  boostFieldNum = -1;
  for (int i = 0; i < fieldnames.length; i++) {
 if (fieldnames[i].equals("boost")) {
boostFieldNum = i;
break;
 }
  }
  if (boostFieldNum >= 0) {
 fieldnames = removeElement(fieldnames, boostFieldNum);
  }

  super.prepareFields();
   }

   public void addDoc(int line, String[] vals) throws IOException {
  templateAdd.indexedId = null;
  SolrInputDocument doc = new SolrInputDocument();
  if (boostFieldNum >= 0) {
 float boost = Float.parseFloat(vals[boostFieldNum]);
 doc.setDocumentBoost(boost);
 vals = removeElement(vals, boostFieldNum);
  }

  doAdd(line, vals, doc, templateAdd);
   }
}

  
> CSV document and field boosting support
> ---
>
> Key: SOLR-1069
> URL: https://issues.apache.org/jira/browse/SOLR-1069
> Project: Solr
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Priority: Minor
> Attachments: CSVLoader.java, CSVRequestHa

[jira] Updated: (SOLR-1069) CSV document and field boosting support

2010-02-25 Thread Dallan Quass (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dallan Quass updated SOLR-1069:
---

Attachment: CSVLoader.java
CSVRequestHandler.java.diff

FWIW, I made a few changes to CSVRequestHandler.java, which mainly involve 
extracting CSVLoader into a separate public class and making a few 
variables/functions visible outside the package.  The attached files show the 
changes I made.  

Doing this allowed me to create a subclass of CSVLoader that does boosting:

public class BoostingCSVRequestHandler extends ContentStreamHandlerBase {
   protected ContentStreamLoader newLoader(SolrQueryRequest req, 
UpdateRequestProcessor processor) {
  return new BoostingCSVLoader(req, processor);
   }

    SolrInfoMBeans methods //
   @Override
   public String getDescription() {
 return "boost CSV documents";
   }

   @Override
   public String getVersion() {
 return "";
   }

   @Override
   public String getSourceId() {
 return "";
   }

   @Override
   public String getSource() {
 return "";
   }
}

class BoostingCSVLoader extends CSVLoader {
   int boostFieldNum;

   BoostingCSVLoader(SolrQueryRequest req, UpdateRequestProcessor processor) {
  super(req, processor);
   }

   private String[] removeElement(String[] a, int pos) {
  String[] n = new String[a.length-1];
  if (pos > 0) System.arraycopy(a, 0, n, 0, pos);
  if (pos < n.length) System.arraycopy(a, pos+1, n, pos, n.length - pos);
  return n;
   }

   @Override
   protected void prepareFields() {
  boostFieldNum = -1;
  for (int i = 0; i < fieldnames.length; i++) {
 if (fieldnames[i].equals("boost")) {
boostFieldNum = i;
break;
 }
  }
  if (boostFieldNum >= 0) {
 fieldnames = removeElement(fieldnames, boostFieldNum);
  }

  super.prepareFields();
   }

   public void addDoc(int line, String[] vals) throws IOException {
  templateAdd.indexedId = null;
  SolrInputDocument doc = new SolrInputDocument();
  if (boostFieldNum >= 0) {
 float boost = Float.parseFloat(vals[boostFieldNum]);
 doc.setDocumentBoost(boost);
 vals = removeElement(vals, boostFieldNum);
  }

  doAdd(line, vals, doc, templateAdd);
   }
}


> CSV document and field boosting support
> ---
>
> Key: SOLR-1069
> URL: https://issues.apache.org/jira/browse/SOLR-1069
> Project: Solr
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Priority: Minor
> Attachments: CSVLoader.java, CSVRequestHandler.java.diff
>
>
> It would be good if CSV loader could do document and field boosting.  
> I believe this could be handled via additional "special" columns that are 
> tacked on such as "doc.boost" and .boost, which are then filled 
> in with boost values on a per row basis.  Obviously, this approach would 
> prevent someone having an actual column named .boost, so maybe we 
> can make that configurable as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1364) Distributed search return Solr shard header information (like qtime)

2010-02-25 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838521#action_12838521
 ] 

Otis Gospodnetic commented on SOLR-1364:


Ian - please copy/paste an example of your own stats gathered with your 
changes, so it's easy for people to see what the output is, thus making it easy 
to understand and evaluate.

> Distributed search return Solr shard header information (like qtime)
> 
>
> Key: SOLR-1364
> URL: https://issues.apache.org/jira/browse/SOLR-1364
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1364.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Distributed queries can expose the Solr shard query information
> such as qtime. The aggregate qtime can be broken up into the
> time required for each stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1788) Remove duplicate field in schema.xml

2010-02-25 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-1788.


Resolution: Won't Fix

Please email questions to solr-user list.

> Remove duplicate field in schema.xml
> 
>
> Key: SOLR-1788
> URL: https://issues.apache.org/jira/browse/SOLR-1788
> Project: Solr
>  Issue Type: New Feature
>Reporter: Bill Bell
>
> Is there a way to remove duplicates in a multiValue field? For example if I 
> add the following - is there a way to remove the duplicates? If not directly 
> in schema.xml how about in DIH?
> 
> Full Bathrooms = 2
> Bedrooms = 2
> Bedrooms = 2
> Full Bathrooms = 2
> Property Address = Orange,92805
> Property Type = Apartments
> 
> This would be changed to:
> 
> Bedrooms = 2
> Full Bathrooms = 2
> Property Address = Orange,92805
> Property Type = Apartments
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1752) SolrJ fails with exception when passing document ADD and DELETEs in the same request using XML request writer (but not binary request writer)

2010-02-25 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838486#action_12838486
 ] 

Shalin Shekhar Mangar commented on SOLR-1752:
-

Jayson, Solr's update XML does not define a container tag so we are constrained 
to only one of add/delete/commit/optimize at a time. Binary format of course 
does not have this problem. So unless we decide to add a root tag to the update 
XML, this exception will happen.

So I guess we have the following options:
# Disallow more than one type of operation for any request writer
# Document this behavior in the UpdateRequest javadocs.

I'd prefer #2 even though it is inconsistent.

> SolrJ fails with exception when passing document ADD and DELETEs in the same 
> request using XML request writer (but not binary request writer)
> -
>
> Key: SOLR-1752
> URL: https://issues.apache.org/jira/browse/SOLR-1752
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, update
>Affects Versions: 1.4
>Reporter: Jayson Minard
>Assignee: Shalin Shekhar Mangar
>Priority: Blocker
>
> Add this test to SolrExampleTests.java and it will fail when using the XML 
> Request Writer (now default), but not if you change the SolrExampleJettyTest 
> to use the BinaryRequestWriter.
> {code}
>  public void testAddDeleteInSameRequest() throws Exception {
> SolrServer server = getSolrServer();
> SolrInputDocument doc3 = new SolrInputDocument();
> doc3.addField( "id", "id3", 1.0f );
> doc3.addField( "name", "doc3", 1.0f );
> doc3.addField( "price", 10 );
> UpdateRequest up = new UpdateRequest();
> up.add( doc3 );
> up.deleteById("id001");
> up.setWaitFlush(false);
> up.setWaitSearcher(false);
> up.process( server );
>   }
> {code}
> terminates with exception:
> {code}
> Feb 3, 2010 8:55:34 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Illegal to have multiple roots 
> (start tag in epilog?).
>  at [row,col {unknown-source}]: [1,125]
>   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:72)
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>   at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>   at org.mortbay.jetty.Server.handle(Server.java:285)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>   at 
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:723)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>   at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>   at 
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal to have multiple 
> roots (start tag in epilog?).
>  at [row,col {unknown-source}]: [1,125]
>   at 
> com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
>   at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
>   at 
> com.ctc.wstx.sr.BasicStreamReader.handleExtraRoot(BasicStreamReader.java:2155)
>   at 
> com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2070)
>   at 
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2647)
>   at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
>   at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:90)
>   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>   ... 18 more
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1364) Distributed search return Solr shard header information (like qtime)

2010-02-25 Thread ian connor (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838473#action_12838473
 ] 

ian connor commented on SOLR-1364:
--

Hi Jason, this patch is not that good. It does expose where you can capture 
that information - but does not report it on each request. Instead, it adds it 
to the statistics page and calculates a running average, total and count per 
shard. It will at least help you see if you have a hot shard that is on average 
taking a long time.

> Distributed search return Solr shard header information (like qtime)
> 
>
> Key: SOLR-1364
> URL: https://issues.apache.org/jira/browse/SOLR-1364
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1364.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Distributed queries can expose the Solr shard query information
> such as qtime. The aggregate qtime can be broken up into the
> time required for each stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1375) BloomFilter on a field

2010-02-25 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838446#action_12838446
 ] 

Otis Gospodnetic commented on SOLR-1375:


{quote}
When new segments are created, and commit is called, a new
bloom filter is generated from a given field (default:id) by
iterating over the term dictionary values. There's a bloom
filter file per segment, which is managed on each Solr shard.
When segments are merged away, their corresponding .blm files is
also removed. 
{quote}

Doesn't this hint at some of this stuff (haven't looked at the patch) really 
needing to live in Lucene index segment files merging land?


> BloomFilter on a field
> --
>
> Key: SOLR-1375
> URL: https://issues.apache.org/jira/browse/SOLR-1375
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, 
> SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1794) Dataimport of CLOB fields fails when getCharacterStream() is defined in a superclass

2010-02-25 Thread Gunnar Gauslaa Bergem (JIRA)
Dataimport of CLOB fields fails when getCharacterStream() is defined in a 
superclass


 Key: SOLR-1794
 URL: https://issues.apache.org/jira/browse/SOLR-1794
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.4
 Environment: Oracle WebLogic 10.3.2
Reporter: Gunnar Gauslaa Bergem


When running Solr on WebLogic application server 10.3.2, the dataimport for 
CLOB fields are failing. Line 109 in FieldReaderDataSource.java illustrates the 
problem:

Method m = clob.getClass().getDeclaredMethod("getCharacterStream");

Since getDeclaredMethod instead of getMethod is used, the getCharacterStream() 
method will not be found if it is defined in a superclass of clob. This is 
exactly what
happens in e.g. WebLogic 10.3.2, since the object returned is a dynamically 
created wrapper class called Clob_oracle_sql_CLOB. This class does not define
getCharacterStream(), but it inherits from another class that does. This 
problem will also occur in other places where getDeclaredMethod used in 
conjunction with the CLOB
or BLOB datatypes.

Stacktrace:

org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to get 
reader from clob Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.FieldReaderDataSource.readCharStream(FieldReaderDataSource.java:118)
at 
org.apache.solr.handler.dataimport.ClobTransformer.readFromClob(ClobTransformer.java:69)
at 
org.apache.solr.handler.dataimport.ClobTransformer.transformRow(ClobTransformer.java:61)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Caused by: java.lang.NoSuchMethodException: 
weblogic.jdbc.wrapper.Clob_oracle_sql_CLOB.getCharacterStream()
at java.lang.Class.getDeclaredMethod(Class.java:1937)
at 
org.apache.solr.handler.dataimport.FieldReaderDataSource.readCharStream(FieldReaderDataSource.java:109)
... 11 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.