Re: Solr & Java 1.6 ... was: Re: [jira] Commented: (SOLR-1873) Commit Solr Cloud to trunk

2010-04-15 Thread Jason Rutherglen
I'm planning on using Solr Cloud, kind of waiting for the commit to
trunk so lets do it (ie, Java6).

On Wed, Apr 14, 2010 at 11:32 PM, Ryan McKinley  wrote:
> I'm fine with 1.6 as a min requirement...  but i imagine others have
> different opinions :)
>
>
> On Wed, Apr 14, 2010 at 2:53 PM, Yonik Seeley
>  wrote:
>> Yes, it requires that Solr in general is compiled with Java6.  We
>> should make our lives easier and make Java6 a Solr requirement.
>> Zookeeper requires Java6, and we also want Java6 for some of the
>> scripting capabilities.
>>
>> -Yonik
>> Apache Lucene Eurocon 2010
>> 18-21 May 2010 | Prague
>>
>>
>> On Wed, Apr 14, 2010 at 2:35 PM, Chris Hostetter
>>  wrote:
>>>
>>> I haven't been following the Cloud stuff very closely, can someone clarify
>>> what exactly the situation is w/Solr Cloud and Java 1.6.
>>>
>>> Will merging the cloud changes to trunk require that core pieces of Solr
>>> be compiled/run with Java 1.6 (ie: a change to our minimum operating
>>> requirements) or will it just require that people wanting cloud
>>> management features use a 1.6 JVM and include a new solr contrib and
>>> appropriate config options at run time (and this contrib is the only thing
>>> that needs to be compiled with 1.6) ?
>>>
>>> As far as hudson and the build system goes ... there's certainly no reason
>>> we can't have more then one setup ... one build using JDK 1.5 (with the
>>> build.xml files detecting the JDK version and vocally not building the
>>> code that can't be compiled (either just the contrib, or all of solr) and
>>> a seperate build using JDK 1.6 that builds and test everything.
>>>
>>> (having this setup in general would be handy if/when other lucene contribs
>>> start wanting to incorporate Java 1.6 features)
>>>
>>>
>>> : bq. As I wrap up the remaining work here, one issue looms: We are going
>>> : to need to move Hudson to Java 6 before this can be committed.
>>> :
>>> : In most respects, I think that would be a positive anyway.  Java6 is now
>>> : the primary production deployment platform for new projects (and it's
>>> : new projects that will be using new lucene and/or solr).  With respect
>>> : to keeping Lucene Java5 compatible, we can always run the tests with
>>> : Java5 before commits (that's what I did in the past when Lucene was on
>>> : Java1.4)
>>>
>>>
>>>
>>> -Hoss
>>>
>>>
>>
>


[jira] Commented: (SOLR-1375) BloomFilter on a field

2010-03-30 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851637#action_12851637
 ] 

Jason Rutherglen commented on SOLR-1375:


{quote}Doesn't this hint at some of this stuff (haven't looked at the patch) 
really needing to live in Lucene index segment files merging land?{quote}

Adding this to Lucene is out of the scope of what I require, however I don't 
have time unless it's going to be committed.

> BloomFilter on a field
> --
>
> Key: SOLR-1375
> URL: https://issues.apache.org/jira/browse/SOLR-1375
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, 
> SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-03-02 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Fixed the unit tests that were failing due to the switch over to using 
CoreContainer's initZooKeeper method.  ZkNodeCoresManager is instantiated in 
CoreContainer.  

There's a beginning of a UI in zkcores.jsp

I think we still need a core move test.  I'm thinking of adding backing up a 
core as an action that may be performed in a new cores version file.  

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-03-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839937#action_12839937
 ] 

Jason Rutherglen commented on SOLR-1724:


I'm starting work on the cores file upload.  The cores file is in JSON format, 
and can be assembled by an entirely different process (i.e. the core assignment 
creation is decoupled from core deployment).  

I need to figure out how Solr HTML HTTP file uploading works... There's 
probably an example somewhere.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839520#action_12839520
 ] 

Jason Rutherglen commented on SOLR-1724:


Started on the nodes reporting their status to separate files that are 
ephemeral nodes, there's no sense in keeping them around if the node isn't up, 
and the status is legitimately ephemeral.  In this case, the status will be 
something like "Core download 45% (7 GB of 15GB)".  

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-26 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838926#action_12838926
 ] 

Jason Rutherglen commented on SOLR-1724:


In thinking about this some more, in order for the functionality
provided in this issue to be more useful, there could be a web
based UI to easily view the master cores table. There can
additionally be an easy way to upload the new cores version into
Zookeeper. I'm not sure if the uploading should be web based or
command line, I'm figuring web based, simply because this is
more in line with the rest of Solr. 

As a core is installed or is in the midst of some other process
(such as backing itself up), the node/NodeCoresManager can
report the ongoing status to Zookeeper. For large cores (i.e. 20
GB) it's important to see how they're doing, and if they're
taking too long, begin some remedial action. The UI can display
the statuses. 


> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-25 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Backing a core up works, at least according to the test case... I will probably 
begin to test this patch in a staging environment next, where Zookeeper is run 
in it's own process and a real HDFS cluster is used.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>    Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-25 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Zipping from a Lucene directory works and has a test case

A ReplicationHandler is added by default under a unique name, if one exists 
already, we still create our own, for the express purpose of locking an index 
commit point, zipping it, then uploading it to, for example, HDFS.  This part 
will likely be written next.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837898#action_12837898
 ] 

Jason Rutherglen commented on SOLR-1724:


I'm not sure how we'll handle (or if we even need to) installing
a new core over an existing core of the same name, in other
words core replacement. I think the instanceDir would need to be
different, which means we'll need to detect and fail on the case
of a new cores version (aka desired state) trying to install
itself into an existing core's instanceDir. Otherwise this
potential error case is costly in production. 

It makes me wonder about the shard id in Solr Cloud and how that
can be used to uniquely identify an installed core, if a core of
a given name is not guaranteed to be the same across Solr
servers.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-23 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

I added a test case that simulates attempting to install a bad core.

Still need to get the backup a Solr core to HDFS working.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-23 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837418#action_12837418
 ] 

Jason Rutherglen commented on SOLR-1724:


We need a test case with a partial install, and cleaning up any extraneous 
files afterwards

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836898#action_12836898
 ] 

Jason Rutherglen commented on SOLR-1724:


Actually, I just realized the whole exercise of moving a core is pointless, 
it's exactly the same as replication, so this is a non-issue...

I'm going to work on backing up a core to HDFS...

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836896#action_12836896
 ] 

Jason Rutherglen commented on SOLR-1724:


I'm taking the approach of simply reusing SnapPuller and a replication handler 
for each core... This'll be faster to implement and more reliable for the first 
release (ie I won't run into little wacky bugs because I'll be reusing code 
that's well tested).  

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836022#action_12836022
 ] 

Jason Rutherglen commented on SOLR-1724:


We need a URL type parameter to define if a URL in a core info is to a zip file 
or to a Solr server download point.  

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836018#action_12836018
 ] 

Jason Rutherglen commented on SOLR-1724:


Some further notes... I can reuse the replication code, but am going to place 
the functionality into core admin handler because it needs to work across cores 
and not have to be configured in each core's solrconfig.  

Also, we need to somehow support merging cores... Is that available yet?  Looks 
like merge indexes is only for directories?

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836013#action_12836013
 ] 

Jason Rutherglen commented on SOLR-1724:


I think the check on whether a conf file's been modified, to reload the core, 
can borrow from the replication handler and check the diff based on the 
checksum of the files... Though this somewhat complicates the storage of the 
checksum and the resultant JSON file.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835981#action_12835981
 ] 

Jason Rutherglen commented on SOLR-1724:


{quote}Will this http access also allow a cluster with
incrementally updated cores to replicate a core after a node
failure? {quote}

You're talking about moving an existing core into HDFS? That's a
great idea... I'll add it to the list!

Maybe for general "actions" to the system, there can be a ZK
directory acting as a queue that contains actions to be
performed by the cluster. When the action is completed it's
corresponding action file is deleted. 

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Can we move FileFetcher out of SnapPuller?

2010-02-19 Thread Jason Rutherglen
Can we move FileFetcher out of SnapPuller? This will assist with
reusing the replication handler for moving/copying cores.


[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835965#action_12835965
 ] 

Jason Rutherglen commented on SOLR-1724:


For the above core moving, utilizing the existing Java replication will 
probably be suitable.  However, in all cases we need to copy the contents of 
all files related to the core (meaning everything under conf and data).  How 
does one accomplish this?

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835955#action_12835955
 ] 

Jason Rutherglen commented on SOLR-1724:


Also needed is the ability to move an existing core to a
different Solr server. The core will need to be copied via
direct HTTP file access, from a Solr server to another Solr
server. There is no need to zip the core first. 

This feature is useful for core indexes that have been
incrementally built, then need to be archived (i.e. the index was not
constructed using Hadoop).

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835871#action_12835871
 ] 

Jason Rutherglen edited comment on SOLR-1724 at 2/19/10 6:36 PM:
-

Removing cores seems to work well, on to modified cores... I'm checkpointing 
progress in case things break, I can easily roll back.

  was (Author: jasonrutherglen):
Removing cores seems to work well, on to modified cores... I checkpointing 
progress in case things break, I can easily roll back.
  
> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Removing cores seems to work well, on to modified cores... I checkpointing 
progress in case things break, I can easily roll back.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835819#action_12835819
 ] 

Jason Rutherglen commented on SOLR-1724:


We need a test case for deleted and modified cores.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Added a way to hold a given number of host or cores files around in ZK, after 
which, the oldest are deleted.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835513#action_12835513
 ] 

Jason Rutherglen commented on SOLR-1724:


I need to add the deletion policy before I can test this in a real environment, 
otherwise bunches of useless files will pile up in ZK.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Updated to HEAD

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835490#action_12835490
 ] 

Jason Rutherglen commented on SOLR-1724:


I need to figure out how integrate this with the Solr Cloud distributed search 
stuff... Hmm... Maybe I'll start with the Solr Cloud test cases?

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

* No-commit

* NodeCoresManagerTest.testInstallCores works

* There's HDFS test cases using MiniDFSCluster



> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>    Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-16 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

No-commit

NodeCoresManager[Test] needs more work

A CoreController matchHosts unit test was added to CoreControllerTest

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834539#action_12834539
 ] 

Jason Rutherglen commented on SOLR-1724:


There's a wiki for this issue where the general specification is defined: 

http://wiki.apache.org/solr/DeploymentofSolrCoreswithZookeeper

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833108#action_12833108
 ] 

Jason Rutherglen commented on SOLR-1301:


There still seems to be a bug where the temporary directory index isn't deleted 
on job completion.

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1395) Integrate Katta

2010-02-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832587#action_12832587
 ] 

Jason Rutherglen commented on SOLR-1395:


shyjuThomas,

It'd be good to update this patch to the latest Katta... You're welcome to do 
so... For my project I only need what'll be in SOLR-1724... 

> Integrate Katta
> ---
>
> Key: SOLR-1395
> URL: https://issues.apache.org/jira/browse/SOLR-1395
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
> katta.node.properties, katta.zk.properties, log4j-1.2.13.jar, 
> solr-1395-1431-3.patch, solr-1395-1431-4.patch, solr-1395-1431.patch, 
> SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
> test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, zookeeper-3.2.1.jar
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We'll integrate Katta into Solr so that:
> * Distributed search uses Hadoop RPC
> * Shard/SolrCore distribution and management
> * Zookeeper based failover
> * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1761) Command line Solr check softwares

2010-02-08 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1761:
---

Attachment: SOLR-1761.patch

Here's a cleaned up, commitable version

> Command line Solr check softwares
> -
>
> Key: SOLR-1761
> URL: https://issues.apache.org/jira/browse/SOLR-1761
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>    Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: SOLR-1761.patch, SOLR-1761.patch
>
>
> I'm in need of a command tool Nagios and the like can execute that verifies a 
> Solr server is working... Basically it'll be a jar with apps that return 
> error codes if a given criteria isn't met.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1761) Command line Solr check softwares

2010-02-08 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1761:
---

Attachment: SOLR-1761.patch

No-commit

Here's a couple apps that:

1) Check the query time
2) Check the last replication time

They exit with error code 1 on failure, 0 on success

> Command line Solr check softwares
> -
>
> Key: SOLR-1761
> URL: https://issues.apache.org/jira/browse/SOLR-1761
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>    Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: SOLR-1761.patch
>
>
> I'm in need of a command tool Nagios and the like can execute that verifies a 
> Solr server is working... Basically it'll be a jar with apps that return 
> error codes if a given criteria isn't met.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Real-time deletes

2010-02-08 Thread Jason Rutherglen
Hello there dude...

I started on this, http://issues.apache.org/jira/browse/SOLR-1606

However since then things have changed, so it may not work... You're
welcome to continue on it...

Cheers,

Jason

On Tue, Feb 9, 2010 at 3:20 PM, Kaktu Chakarabati  wrote:
> Hey Guys,
> havent heard back from anyone - Would really appreciate any response what so
> ever (even a 'extremely not feasible right now'), just so i know
> if to try and pursue this direction or abandon..
>
> Thanks,
> -Chak
>
> On Fri, Feb 5, 2010 at 11:41 AM, KaktuChakarabati wrote:
>
>>
>> Hey,
>> some time ago I asked around and found out that lucene has inbuilt support
>> pretty much for propagating deletes to the active index without a lengthy
>> commit ( I do not remember the exact semantics but I believe it involves
>> using an IndexReader reopen() method or so).
>> I wanted to check back and find out whether solr now makes use of this in
>> any way - Otherwise, is anyone working on such a feature - And Otherwise,
>> if
>> i'd like to pick up the glove on this, what would be a correct way,
>> architecture-wise to go about it ? implement as a separate UpdateHandler /
>> flag..?
>>
>> Thanks,
>> -Chak
>> --
>> View this message in context:
>> http://old.nabble.com/Real-time-deletes-tp27472975p27472975.html
>> Sent from the Solr - Dev mailing list archive at Nabble.com.
>>
>>
>


[jira] Created: (SOLR-1761) Command line Solr check softwares

2010-02-06 Thread Jason Rutherglen (JIRA)
Command line Solr check softwares
-

 Key: SOLR-1761
 URL: https://issues.apache.org/jira/browse/SOLR-1761
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5


I'm in need of a command tool Nagios and the like can execute that verifies a 
Solr server is working... Basically it'll be a jar with apps that return error 
codes if a given criteria isn't met.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829200#action_12829200
 ] 

Jason Rutherglen commented on SOLR-1301:


In production the latest patch does not leave temporary files behind... Though 
before we had failed tasks, so perhaps there's still a bug, we won't know until 
we run out of disk space again.

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1301:
---

Attachment: SOLR-1301.patch

I added the following to the SRW.close method's finally clause:

{code}
FileUtils.forceDelete(new File(temp.toString()));
{code}

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828368#action_12828368
 ] 

Jason Rutherglen commented on SOLR-1301:


I'm testing deleting the temp dir in SRW.close finally...

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828172#action_12828172
 ] 

Jason Rutherglen commented on SOLR-1301:


There's a bug caused by the latest change:
{quote}
java.io.IOException: java.lang.IllegalArgumentException: Wrong FS: 
hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_01_0.1.index-a,
 expected: file:///
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:371)
at 
com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:147)
at 
com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:103)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_01_0.1.index-a,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:305)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at 
org.apache.solr.hadoop.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:459)
at 
org.apache.solr.hadoop.SolrRecordWriter.packZipFile(SolrRecordWriter.java:390)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:362)
... 5 more 
{quote}

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1301) Solr + Hadoop

2010-01-31 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1301:
---

Attachment: SOLR-1301.patch

This update include's Kevin's recommended path change

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-28 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Here's an update, we're onto the actual Solr node portion of the code, and some 
tests around that.  I'm focusing on downloading cores out of HDFS because 
that's my use case.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
> SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-28 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: gson-1.4.jar
hadoop-0.20.2-dev-test.jar
hadoop-0.20.2-dev-core.jar

Hadoop and Gson dependencies

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
> hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804773#action_12804773
 ] 

Jason Rutherglen commented on SOLR-1724:


For some reason ZkTestServer doesn't need to be shutdown any longer?

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804760#action_12804760
 ] 

Jason Rutherglen commented on SOLR-1724:


The ZK port changed in ZkTestServer

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804750#action_12804750
 ] 

Jason Rutherglen commented on SOLR-1724:


I did an svn update, though now am seeing the following error:

java.util.concurrent.TimeoutException: Could not connect to ZooKeeper within 
5000 ms
at 
org.apache.solr.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:131)
at org.apache.solr.cloud.SolrZkClient.(SolrZkClient.java:106)
at org.apache.solr.cloud.SolrZkClient.(SolrZkClient.java:72)
at 
org.apache.solr.cloud.CoreControllerTest.testCores(CoreControllerTest.java:48)

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804655#action_12804655
 ] 

Jason Rutherglen commented on SOLR-1724:


Need to have a command line tool that dumps the state of the
existing cluster from ZK, out to a json file for a particular
version. 

For my setup I'll have a program that'll look at this cluster
state file and generate an input file that'll be written to ZK,
which essentially instructs the Solr nodes to match the new
cluster state. This allows me to easily write my own
functionality that operates on the cluster that's external to
deploying new software into Solr. 

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804590#action_12804590
 ] 

Jason Rutherglen commented on SOLR-1724:


{quote}If you know your going to not store file data at nodes
that have children (the only way that downloading to a real file
system makes sense), you could just call getChildren - if there
are children, its a dir, otherwise its a file. Doesn't work for
empty dirs, but you could also just do getData, and if it
returns null, treat it as a dir, else treat it as a file.{quote}

Thanks Mark... 

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803943#action_12803943
 ] 

Jason Rutherglen commented on SOLR-1724:


Do we have some code that recursively downloads a tree of files from ZK?  The 
challenge is I don't see a way to find out if a given path represents a 
directory or not.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-21 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: commons-lang-2.4.jar

commons-lang-2.4.jar is required

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: commons-lang-2.4.jar, SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-21 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1724:
---

Attachment: SOLR-1724.patch

Here's the first cut... I agree, I'm not really into ephemeral
ZK nodes for Solr hosts/nodes. The reason is contact with ZK is
highly superficial and can be intermittent. I'm mostly concerned
with insuring the core operations succeed on a given server. If
a server goes down, there needs to be more than ZK to prove it,
and if it goes down completely, I'll simply reallocate it's
cores to another server using the core management mechanism
provided in this patch. 

The issue is still being worked on, specifically the Solr server
portion that downloads the cores from some location, or performs
operations. The file format will move to json. 

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
> Attachments: SOLR-1724.patch
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802526#action_12802526
 ] 

Jason Rutherglen commented on SOLR-1301:


I started on the Solr wiki page for this guy...

http://wiki.apache.org/solr/HadoopIndexing



> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801244#action_12801244
 ] 

Jason Rutherglen commented on SOLR-1724:


This'll be a patch on the cloud branch to reuse what's started, I don't see any 
core management code in there yet, so this looks complimentary.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801216#action_12801216
 ] 

Jason Rutherglen commented on SOLR-1724:


Ted,

Thanks for the Katta link. 

This patch will likely de-emphasize the distributed search part,
which is where the ephemeral node is used (i.e. a given server
lists it's current state). I basically want to take care of this
one little deployment aspect of cores, improving on the wacky
hackedy system I'm running today. Then IF it works, then I'll
look at the distributed search part, hopefully in a totally
separate patch.



> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801215#action_12801215
 ] 

Jason Rutherglen commented on SOLR-1724:


Note to self: I need a way to upload an empty core/confdir from the command 
line, basically into ZK, then reference that core from ZK (I think this'll 
work?).  I'd rather not rely on a separate http server or something... The size 
of a jared up Solr conf dir shouldn't be too much for ZK?

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800994#action_12800994
 ] 

Jason Rutherglen commented on SOLR-1724:


Additionally, upon successful completion of a core-version deployment to a set 
of nodes, then a customizable deletion policy like thing will be default, 
cleanup the old cores on the system.

> Real Basic Core Management with Zookeeper
> -
>
> Key: SOLR-1724
> URL: https://issues.apache.org/jira/browse/SOLR-1724
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
> Fix For: 1.5
>
>
> Though we're implementing cloud, I need something real soon I can
> play with and deploy. So this'll be a patch that only deploys
> new cores, and that's about it. The arch is real simple:
> On Zookeeper there'll be a directory that contains files that
> represent the state of the cores of a given set of servers which
> will look like the following:
> /production/cores-1.txt
> /production/cores-2.txt
> /production/core-host-1-actual.txt (ephemeral node per host)
> Where each core-N.txt file contains:
> hostname,corename,instanceDir,coredownloadpath
> coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
> etc
> and
> core-host-actual.txt contains:
> hostname,corename,instanceDir,size
> Everytime a new core-N.txt file is added, the listening host
> finds it's entry in the list and begins the process of trying to
> match the entries. Upon completion, it updates it's
> /core-host-1-actual.txt file to it's completed state or logs an error.
> When all host actual files are written (without errors), then a
> new core-1-actual.txt file is written which can be picked up by
> another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-15 Thread Jason Rutherglen (JIRA)
Real Basic Core Management with Zookeeper
-

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5


Though we're implementing cloud, I need something real soon I can
play with and deploy. So this'll be a patch that only deploys
new cores, and that's about it. The arch is real simple:

On Zookeeper there'll be a directory that contains files that
represent the state of the cores of a given set of servers which
will look like the following:

/production/cores-1.txt
/production/cores-2.txt
/production/core-host-1-actual.txt (ephemeral node per host)

Where each core-N.txt file contains:

hostname,corename,instanceDir,coredownloadpath

coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
etc

and

core-host-actual.txt contains:

hostname,corename,instanceDir,size

Everytime a new core-N.txt file is added, the listening host
finds it's entry in the list and begins the process of trying to
match the entries. Upon completion, it updates it's
/core-host-1-actual.txt file to it's completed state or logs an error.

When all host actual files are written (without errors), then a
new core-1-actual.txt file is written which can be picked up by
another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Solr Cloud wiki and branch notes

2010-01-15 Thread Jason Rutherglen
> This is really about doing not-so-much in the very near term,
> while thinking ahead to the longer term.

Lets have a page dedicated to release 1.0 of cloud? I feel
uncomfortable editing the existing wiki because I don't know
what the plans are for the first release.

I need to revisit Katta as my short term plans include using
Zookeeper (not for failover) but simply for deploying
shards/cores to servers, and nothing else. I can use the core
admin interface to bring them online, update them etc. Or I'll
just implement something and make a patch to Solr... Thinking
out loud:

/anyname/shardlist-v1.txt /anyname/shardlist-v2.txt

where shardlist-v1.txt contains:
corename,coredownloadpath,instanceDir

Where coredownloadpath can be any URL including hftp, hdfs, ftp, http, https.

Where the system automagically uninstalls cores that should no
longer exist on a given server. Cores with the same name
deployed to the same server would use the reload command,
otherwise the create command.

Where there's a ZK listener on the /anyname directory for new
files that are greater than the last known installed
shardlist.txt.

Alternatively, an even simpler design would be uploading a
solr.xml file per server, something like:
/anyname/solr-prod01.solr.xml

Which a directory listener on each server parses and makes the
necessary changes (without restarting Tomcat).

On the search side in this system, I'd need to wait for the
cores to complete their install, then swap in a new core on the
search proxy that represents the new version of the corelist,
then the old cores could go away. This isn't very different than
the segmentinfos system used in Lucene IMO.

On Fri, Jan 15, 2010 at 1:53 PM, Yonik Seeley  wrote:
> On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen
>  wrote:
>> The page is huge, which signals to me maybe we're trying to do
>> too much
>
> This is really about doing not-so-much in the very near term, while
> thinking ahead to the longer term.
>
>> Revamping distributed search could be in a different branch
>> (this includes partial results)
>
> That could just be a separate patch - it's scope is not that broad (I
> think there may already be a JIRA issue open for it).
>
>> Having a single solrconfig and schema for each core/shard in a
>> collection won't work for me. I need to define each core
>> externally, and I don't want Solr-Cloud to manage this, how will
>> this scenario work?
>
> We do plan on each core being able to have it's own schema (so one
> could try out a version of a schema and gradually migrate the
> cluster).
>
> It could also be possible to define a schema as "local" (i.e. use the
> one on the local file system)
>
>> A host is about the same as node, I don't see the difference, or
>> enough of one
>
> A host is the hardware. It will have limited disk, limited CPU, etc.
> At some point we will want to model this... multiple nodes could be
> launched on one box.  We're not doing anything with it now, and won't
> in the near future.
>
>> Cluster resizing and rebalancing can and should be built
>> externally and hopefully after an initial release that does the
>> basics well
>
> The initial release will certainly not be doing any resizing or rebalancing.
> We should allow this to be done externally.  In the future, we
> shouldn't require that this be done externally though (i.e. we should
> somehow alow the cluster to grow w/o people having to write code).
>
>> Collection is a group of cores?
>
> A collection of documents - the complete search index.  It has a
> single schema, etc.
>
> -Yonik
> http://www.lucidimagination.com
>


Solr Cloud wiki and branch notes

2010-01-15 Thread Jason Rutherglen
Here's some rough notes after running the unit tests, reviewing
some of the code (though not understanding it), and reviewing
the wiki page http://wiki.apache.org/solr/SolrCloud


We need a protocol in the URL, otherwise it's inflexible

I'm overwhelmed with all the ?? question areas of the document.

The page is huge, which signals to me maybe we're trying to do
too much

Revamping distributed search could be in a different branch
(this includes partial results)

Having a single solrconfig and schema for each core/shard in a
collection won't work for me. I need to define each core
externally, and I don't want Solr-Cloud to manage this, how will
this scenario work?

A host is about the same as node, I don't see the difference, or
enough of one

Cluster resizing and rebalancing can and should be built
externally and hopefully after an initial release that does the
basics well

Collection is a group of cores?

I like the model -> reality system. However how does the
versioning work? We need to know what the conversion progress
is? How will the queuing of in-progress alterations work (this
seems hard, I'd rather focus on this, make it work well, than
mess with other things like load balancing in the first release?
i.e. if this doesn't work well, Solr-Cloud isn't production
ready for me)

Shard Identification, this falls under too ambitious right now
IMO

I think we need a wiki page of just the basics of core/shard
management, implement that, then build all the rest of the features on top...
Otherwise this thing feels like it's going to be a nightmare to
test and deploy in production.


Re: [jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Jason Rutherglen
Copying files ala HDFS is trivial because it's sequential,
Lucene merging isn't, so scaling merging over 20 machines vs 4 Solr
has clear advantages... That and on-demand expandability, so I
can reindex 2 terabytes of data in half a day vs weeks or more
with 4 Solr masters has compelling advantages.

On Fri, Jan 15, 2010 at 12:09 PM, Grant Ingersoll  wrote:
> I can see why that is a win over the existing, but I still don't get why it 
> wouldn't be faster just to index to a suite of Solr master indexers and save 
> all this file slogging around.  But, I guess that is a separate patch all 
> together.
>
>
>
> On Jan 15, 2010, at 2:35 PM, Jason Rutherglen wrote:
>
>> Zipping cores/shards is in the latest patch...
>>
>> On Fri, Jan 15, 2010 at 11:22 AM, Andrzej Bialecki  wrote:
>>> On 2010-01-15 20:13, Ted Dunning wrote:
>>>>
>>>> This can also be a big performance win.  Jason Venner reports significant
>>>> index and cluster start time improvements by indexing to local disk,
>>>> zipping
>>>> and then uploading the resulting zip file.  Hadoop has significant file
>>>> open
>>>> overhead so moving one zip file wins big over many index component files.
>>>> There is a secondary bandwidth win as well.
>>>
>>> Indeed, this one should be easy to add to this patch. Unless Jason & Jason
>>> already cooked a patch for this? ;)
>>>
>>>>
>>>> On Fri, Jan 15, 2010 at 8:34 AM, Andrzej Bialecki
>>>> (JIRA)wrote:
>>>>
>>>>>
>>>>> HDFS doesn't support enough POSIX to support writing Lucene indexes
>>>>> directly to HDFS - for this reason indexes are always created on local
>>>>> storage of each node, and then after closing they are copied to HDFS.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
>
>


Re: [jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Jason Rutherglen
Zipping cores/shards is in the latest patch...

On Fri, Jan 15, 2010 at 11:22 AM, Andrzej Bialecki  wrote:
> On 2010-01-15 20:13, Ted Dunning wrote:
>>
>> This can also be a big performance win.  Jason Venner reports significant
>> index and cluster start time improvements by indexing to local disk,
>> zipping
>> and then uploading the resulting zip file.  Hadoop has significant file
>> open
>> overhead so moving one zip file wins big over many index component files.
>> There is a secondary bandwidth win as well.
>
> Indeed, this one should be easy to add to this patch. Unless Jason & Jason
> already cooked a patch for this? ;)
>
>>
>> On Fri, Jan 15, 2010 at 8:34 AM, Andrzej Bialecki
>> (JIRA)wrote:
>>
>>>
>>> HDFS doesn't support enough POSIX to support writing Lucene indexes
>>> directly to HDFS - for this reason indexes are always created on local
>>> storage of each node, and then after closing they are copied to HDFS.
>
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800802#action_12800802
 ] 

Jason Rutherglen commented on SOLR-1301:


bq. Hadoop streaming the output of the reduce tasks to the Solr
indexing servers. 

Yes, this is what we've implemented, it's just normal Solr HTTP
based indexing, right? It works well to a limited degree, and
for the particular implementation details, there are reasons why
this can be less than ideal. The balanced, distributed
shards/cores system works far better and enables us to use less
hardware (but I'm not going into all the details here). 

One issue I can mention, is the switch over to a new set of
incremental servers (which happens then the old servers fill
up), I'm looking to automate this, and will likely focus on it
and the core management in the cloud branch. 

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800775#action_12800775
 ] 

Jason Rutherglen commented on SOLR-1301:


{quote}What I meant was the Hadoop job could simply know what
the set of master indexers are and send the documents directly
to them{quote}

One can use Hadoop for this purpose, we have implemented the
system in this way for the incremental indexes, however it
doesn't require a separate patch or contrib module. The problem
with the Hadoop streaming model is it doesn't scale well, if for
example, we need to reindex using the CJKAnalyzer, or using
Basis' analyzer etc. We use SOLR-1301 for reindexing loads of
data, as fast as possible by parallelizing the indexing. There
are lots of little things I'd like to add to the functionality,
though, implementing ZK based core management takes a higher
priority, as I spend a lot of time doing this manually today.

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800756#action_12800756
 ] 

Jason Rutherglen commented on SOLR-1301:


Andrzej's model works great in production. We have both 1)
master -> slave for incremental updates, and 2) index in Hadoop
with this patch, we then deploy each new core/shard in a
balanced fashion to many servers. They're two separate
modalities. The ZK stuff (as it's modeled today) isn't useful
here, because I want the schema I indexed with as a part of the
zip file stored in HDFS (or S3, or wherever). 

Any sort of ZK thingy is good for managing the core/shards
across many servers, however Katta does this already (so we're
either reinventing the same thing, not necessarily a bad thing
if we also have a clear path for incremental indexing, as
discussed above). Ultimately, the Solr server can be viewed as
simply a container for cores, and the cloud + ZK branch as a
manager of cores/shards. Anything more ambitious will probably
be overkill, and this is what I believe Ted has been trying to get at.

> Solr + Hadoop
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Andrzej Bialecki 
> Fix For: 1.5
>
> Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SolrCloud logical shards

2010-01-15 Thread Jason Rutherglen
> The point I was trying to make is that I believe that if you start changing 
> terminologies now people will be very confused

So shard -> remote core... Slice -> core group.  Though semantically
they're synonyms.  In any case, I need to spend some time looking at
the cloud branch, and less time jibber-jabberin' about it.

On Fri, Jan 15, 2010 at 1:24 AM, Uri Boness  wrote:
>>
>> Can you elaborate on what you mean, isn't a core a single index
>> too? It seems like shard was used to represent a remote index
>> (perhaps?).
>
> Yes, a core is a single index and a shard is a conceptual idea which at the
> moment concretely refers to a remote core (but not a specific one as the
> same shard can be represented by multiple core replicas). The point I was
> trying to make is that I believe that if you start changing terminologies
> now people will be very confused. And I thought of sticking to Yonik's
> suggestion of a "slice" just to prevent this confusion. On the other hand
> one can argue that the terminology as it is today is already confusing...
> and if you really want to get it right and be aligned with the "rest of the
> world" (if there is such a thing... from what I've seen so far sharding is
> used differently in different contexts), then perhaps a "good" timing for
> making such terminology changes is with a major release (Solr 2.0?) as with
> such release people tend to be more open for new/changed concepts.
>
> Cheers,
> Uri
>
> Jason Rutherglen wrote:
>>
>> Uri,
>>
>>
>>>
>>> "core" to represent a single index and "shard" to be
>>> represented by a single core
>>>
>>
>> Can you elaborate on what you mean, isn't a core a single index
>> too? It seems like shard was used to represent a remote index
>> (perhaps?). Though here I'd prefer "remote core", because to the
>> uninitiated Solr outsider it's immediately obvious (i.e. they
>> need only know what a core is, in the Solr glossary or term
>> dictionary).
>>
>> In Google vernacular, which is where the name shard came from, a
>> "shard" is basically a local sub-index
>> http://research.google.com/archive/googlecluster.html where
>> there would be many "shards" per server. However that's a
>> digression at this point.
>>
>> I personally prefer relatively straightforward names, that are
>> self-evident, rather than inventing new language for fairly
>> simple concepts. Slice, even though it comes from our buddy
>> Yonik, probably doesn't make any immediate sense to external
>> users when compared with the word shard. Of course software
>> projects have a tendency to create their own words to somewhat
>> mystify users into believing in some sort of magic occurring
>> underneath. If that's what we're after, it's cool, I mean that
>> makes sense. And I don't mean to be derogatory here however this
>> is an open source project created in part to educate users on
>> search and be made easily accessible as possible, to the
>> greatest number of users possible. I think Doug did a create job
>> of this when Lucene started with amazingly succinct code for
>> fairly complex concepts (eg, anti-mystification of search).
>>
>> Jason
>>
>> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness  wrote:
>>
>>>
>>> Although Jason has some valid points here, I'm with Yonik here. I do
>>> believe
>>> that we've gotten used to the terms "core" to represent a single index
>>> and
>>> "shard" to be represented by a single core. A "node" seems to indicate a
>>> machine or a JVM. Changing any of these (informal perhaps) definitions
>>> will
>>> only cause confusion. That's why I think a "slice" is a good solution
>>> now...
>>> first it's a new term to a new view of the index (logical shard AFAIK
>>> don't
>>> really exists yet) so people won't need to get used to it, but it's also
>>> descriptive and intuitive. I do like Jason's idea about having a protocol
>>> attached to the URL's.
>>>
>>> Cheers,
>>> Uri
>>>
>>> Jason Rutherglen wrote:
>>>
>>>>>
>>>>> But I've kind of gotten used to thinking of shards as the
>>>>> actual physical queryable things...
>>>>>
>>>>>
>>>>
>>>> I think a mistake was made referring to Solr cores 

Re: SolrCloud logical shards

2010-01-14 Thread Jason Rutherglen
Uri,

> "core" to represent a single index and "shard" to be
> represented by a single core

Can you elaborate on what you mean, isn't a core a single index
too? It seems like shard was used to represent a remote index
(perhaps?). Though here I'd prefer "remote core", because to the
uninitiated Solr outsider it's immediately obvious (i.e. they
need only know what a core is, in the Solr glossary or term
dictionary).

In Google vernacular, which is where the name shard came from, a
"shard" is basically a local sub-index
http://research.google.com/archive/googlecluster.html where
there would be many "shards" per server. However that's a
digression at this point.

I personally prefer relatively straightforward names, that are
self-evident, rather than inventing new language for fairly
simple concepts. Slice, even though it comes from our buddy
Yonik, probably doesn't make any immediate sense to external
users when compared with the word shard. Of course software
projects have a tendency to create their own words to somewhat
mystify users into believing in some sort of magic occurring
underneath. If that's what we're after, it's cool, I mean that
makes sense. And I don't mean to be derogatory here however this
is an open source project created in part to educate users on
search and be made easily accessible as possible, to the
greatest number of users possible. I think Doug did a create job
of this when Lucene started with amazingly succinct code for
fairly complex concepts (eg, anti-mystification of search).

Jason

On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness  wrote:
> Although Jason has some valid points here, I'm with Yonik here. I do believe
> that we've gotten used to the terms "core" to represent a single index and
> "shard" to be represented by a single core. A "node" seems to indicate a
> machine or a JVM. Changing any of these (informal perhaps) definitions will
> only cause confusion. That's why I think a "slice" is a good solution now...
> first it's a new term to a new view of the index (logical shard AFAIK don't
> really exists yet) so people won't need to get used to it, but it's also
> descriptive and intuitive. I do like Jason's idea about having a protocol
> attached to the URL's.
>
> Cheers,
> Uri
>
> Jason Rutherglen wrote:
>>>
>>> But I've kind of gotten used to thinking of shards as the
>>> actual physical queryable things...
>>>
>>
>> I think a mistake was made referring to Solr cores as shards.
>> It's the same thing with 2 different names. Slices adds yet
>> another name which seems to imply the same thing yet again. I'd
>> rather see disambiguation here, and call them cores (partially
>> because that's what's in the code and on the wiki), and cores
>> only. It's a Solr specific term, it's going to be confused with
>> microprocessor cores, but at least there's only one name, which
>> as search people, we know creates fewer posting lists :).
>>
>> Logical groupings of cores can occur, which can be aptly named
>> core groups. This way I can submit a query to a core group, and
>> it's reasonable to assume I'm hitting N cores. Further, cores
>> could point to a logical or physical entity via a URL. (As a
>> side note, I've always found it odd that the shards param to
>> RequestHandler lacks the protocol, what if I want to use HTTPS
>> for example?).
>>
>> So there could be http://host/solr/core1 (physical),
>> core://megacorename (logical),
>> coregroup://supergreatcoregroupname (a group of cores) in the
>> "shards" parameter (whose name can perhaps be changed for
>> clarity in a future release). Then people can mix and match and
>> we won't have many different XML elements floating around. We'd
>> have a simple list of URLs that are transposed into a real
>> physical network request.
>>
>>
>> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
>>  wrote:
>>
>>>
>>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>>>  wrote:
>>>
>>>>
>>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>>>>  wrote:
>>>>
>>>>>
>>>>> I'm actually starting to lean toward "slice" instead of "logical
>>>>> shard".
>>>>>
>>>
>>> Alternate terminology could be "index" for the actual physical lucene
>>> lindex (and also enough of the URL that unambiguously identifies it),
>>> and then "shard" could be the logical entity.
>>>
>>> But I've kind of gotten used to thinking of shards as the actual
>>> physical queryable things...
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>
>>
>


Re: SolrCloud logical shards

2010-01-14 Thread Jason Rutherglen
> But I've kind of gotten used to thinking of shards as the
> actual physical queryable things...

I think a mistake was made referring to Solr cores as shards.
It's the same thing with 2 different names. Slices adds yet
another name which seems to imply the same thing yet again. I'd
rather see disambiguation here, and call them cores (partially
because that's what's in the code and on the wiki), and cores
only. It's a Solr specific term, it's going to be confused with
microprocessor cores, but at least there's only one name, which
as search people, we know creates fewer posting lists :).

Logical groupings of cores can occur, which can be aptly named
core groups. This way I can submit a query to a core group, and
it's reasonable to assume I'm hitting N cores. Further, cores
could point to a logical or physical entity via a URL. (As a
side note, I've always found it odd that the shards param to
RequestHandler lacks the protocol, what if I want to use HTTPS
for example?).

So there could be http://host/solr/core1 (physical),
core://megacorename (logical),
coregroup://supergreatcoregroupname (a group of cores) in the
"shards" parameter (whose name can perhaps be changed for
clarity in a future release). Then people can mix and match and
we won't have many different XML elements floating around. We'd
have a simple list of URLs that are transposed into a real
physical network request.


On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
 wrote:
> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>  wrote:
>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>>  wrote:
>>> I'm actually starting to lean toward "slice" instead of "logical shard".
>
> Alternate terminology could be "index" for the actual physical lucene
> lindex (and also enough of the URL that unambiguously identifies it),
> and then "shard" could be the logical entity.
>
> But I've kind of gotten used to thinking of shards as the actual
> physical queryable things...
>
> -Yonik
> http://www.lucidimagination.com
>


[jira] Commented: (SOLR-1720) replication configuration bug with multiple replicateAfter values

2010-01-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799843#action_12799843
 ] 

Jason Rutherglen commented on SOLR-1720:


For consistency maybe we should support comma delimited lists?  I edit the 
shards a lot (comma delimited), which could use different elements as well, so 
by rote, I just used commas for this, because it seemed like a Solr standard... 

Thanks for clarifying!

> replication configuration bug with multiple replicateAfter values
> -
>
> Key: SOLR-1720
> URL: https://issues.apache.org/jira/browse/SOLR-1720
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Yonik Seeley
> Fix For: 1.5
>
>
> Jason reported problems with Multiple replicateAfter values - it worked after 
> changing to just "commit"
> http://www.lucidimagination.com/search/document/e4c9ba46dc03b031/replication_problem

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797898#action_12797898
 ] 

Jason Rutherglen commented on SOLR-1709:


Tim,

Thanks for the patch...

bq. as I'm having a bit of trouble with svn (don't shoot me, but my environment 
is a Redmond-based os company).

TortoiseSVN works well on Windows, even for creating patches.  Have you tried 
it?  



> Distributed Date Faceting
> -
>
> Key: SOLR-1709
> URL: https://issues.apache.org/jira/browse/SOLR-1709
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 1.4
>Reporter: Peter Sturge
>Priority: Minor
>
> This patch is for adding support for date facets when using distributed 
> searches.
> Date faceting across multiple machines exposes some time-based issues that 
> anyone interested in this behaviour should be aware of:
> Any time and/or time-zone differences are not accounted for in the patch 
> (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
> 'instant-in-time', unless all shards are time-synced to the exact same time).
> The implementation uses the first encountered shard's facet_dates as the 
> basis for subsequent shards' data to be merged in.
> This means that if subsequent shards' facet_dates are skewed in relation to 
> the first by >1 'gap', these 'earlier' or 'later' facets will not be merged 
> in.
> There are several reasons for this:
>   * Performance: It's faster to check facet_date lists against a single map's 
> data, rather than against each other, particularly if there are many shards
>   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
> time range larger than that which was requested
> (e.g. a request for one hour's worth of facets could bring back 2, 3 
> or more hours of data)
> This could be dealt with if timezone and skew information was added, and 
> the dates were normalized.
> One possibility for adding such support is to [optionally] add 'timezone' and 
> 'now' parameters to the 'facet_dates' map. This would tell requesters what 
> time and TZ the remote server thinks it is, and so multiple shards' time data 
> can be normalized.
> The patch affects 2 files in the Solr core:
>   org.apache.solr.handler.component.FacetComponent.java
>   org.apache.solr.handler.component.ResponseBuilder.java
> The main changes are in FacetComponent - ResponseBuilder is just to hold the 
> completed SimpleOrderedMap until the finishStage.
> One possible enhancement is to perhaps make this an optional parameter, but 
> really, if facet.date parameters are specified, it is assumed they are 
> desired.
> Comments & suggestions welcome.
> As a favour to ask, if anyone could take my 2 source files and create a PATCH 
> file from it, it would be greatly appreciated, as I'm having a bit of trouble 
> with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



solr-dev@lucene.apache.org

2009-12-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793474#action_12793474
 ] 

Jason Rutherglen commented on SOLR-1665:


Plus one, visibility into the components would be good.  This'll work for 
distributed processes (i.e. time taken on each node per component)?

> Add &debugTimings param so that timings for components can be retrieved 
> without having to do explains(), as in &debugQuery
> --
>
> Key: SOLR-1665
> URL: https://issues.apache.org/jira/browse/SOLR-1665
> Project: Solr
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.5
>
>
> As the title says, it would be great if we could just get back component 
> timings w/o having to do the full boat of explains and other stuff.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)

2009-12-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793358#action_12793358
 ] 

Jason Rutherglen commented on SOLR-1277:


{quote}Zookeeper gives us the layout of the cluster. It doesn't
seem like we need (yet) fast failure detection from zookeeper -
other nodes can do this synchronously themselves (and would need
to anyway) on things like connection failures. App-level
timeouts should not mark the node as failed since we don't know
how long the request was supposed to take.{quote}

Google Chubby when used in conjunction with search sets a high
timeout of 60 seconds I believe?

Fast failover is difficult so it'll be best to enable fast
re-requesting to adjacent slave servers on request failure. 

Mahadev has some good advise about how we can separate the logic
into different znodes. Going further I think we'll want to allow
cores to register themselves, then listen to a separate
directory as to what state each should be in. We'll need to
insure the architecture allows for defining multiple tiers (like a pyramid).

At http://wiki.apache.org/solr/ZooKeeperIntegration is a node a
core or a server/corecontainer?

To move ahead we'll really need to define and settle on the
directory and file structure. I believe the requirement of
grouping cores so that one may issue a search against a group
name, instead of individual shard names will be useful. The
ability to move cores to different nodes will be necessary, as
is the ability to replicate cores (i.e. have multiple copies
available on different servers). 

Today I deploy lots of cores today from HDFS across quite a few
servers containing 1.6 billion documents representing at least
2.4 TB of data. I mention this because a lot can potentially go
wrong in this type of setup (i.e. server's going down, corrupted
data, intermittent network, etc) I generate a file that contains
all the information as to which core should go to which Solr
server using size based balancing. Ideally I'd be able to
generate a new file, perhaps for load balancing the cores across
new Solr servers or to define that hot cores should be
replicated, and the Solr cluster would move the cores to the
defined servers automatically. This doesn't include the separate
set of servers system that handles incremental updates (i.e.
master -> slave). 

There's a bit of trepidation in moving forward on this because
we don't want to engineer ourselves into a hole, however if we
need to change the structure of the znodes in the future, we'll
need a healthy a versioning plan such that one may upgrade a
cluster while maintaining backwards compatibility on a live
system. Lets think of a basic plan for this. 

In conclusion, lets iterate on the directory structure via the
wiki or this issue?

{quote}A search node can have very large caches tied to readers
that all drop at once on commit, and can require a much larger
heap to accommodate these caches. I think thats a more common
scenario that creates these longer pauses.{quote}

The large cache issue should be fixable with the various NRT
changes SOLR-1606. They're collectively not much different than
the search and sort per segment changes made to Lucene 2.9. 

> Implement a Solr specific naming service (using Zookeeper)
> --
>
> Key: SOLR-1277
> URL: https://issues.apache.org/jira/browse/SOLR-1277
> Project: Solr
>      Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.5
>
> Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)

2009-12-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791528#action_12791528
 ] 

Jason Rutherglen commented on SOLR-1277:


bq. as two types of failures, possibly

A failure is a failure and whether it's the GC or something
else, it's really the same thing. Sounds like we're defining the
expectation of the client handling of a failure?

I think we'll need to define groups of shards (maybe this is
already in the spec), and allow a configurable failure setting
per group. For example, group "live" would be allowed to return
partial results because the user always wants results returned
quickly. Group "archive" would always return complete results
(if a node is down it can be configured to retry the request N
times until it succeeds under a given max timeout). 

Also a request could be addressed to a group of shards, which
would allow one set of replicated Zookeeper servers for N Solr
clusters (instead of a Zookeeper server per Solr cluster).  

How are we addressing a failed connection to a slave server, and
instead of failing the request, re-making the request to an
adjacent slave?

> Implement a Solr specific naming service (using Zookeeper)
> --
>
> Key: SOLR-1277
> URL: https://issues.apache.org/jira/browse/SOLR-1277
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.5
>
> Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1506) Search multiple cores using MultiReader

2009-12-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789600#action_12789600
 ] 

Jason Rutherglen commented on SOLR-1506:


There's a different bug here, where because CoreContainer loads
the cores sequentially, and MultiCoreReaderFactory looks for all
the cores, when the proxy core isn't last, not all the cores are
searchable, if the proxy is first, an exception is thrown. 

The workaround is to place the proxy core last, however that's
not possible when using the core admin HTTP API. Hmm... Not sure
what the best workaround is.

> Search multiple cores using MultiReader
> ---
>
> Key: SOLR-1506
> URL: https://issues.apache.org/jira/browse/SOLR-1506
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>    Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Fix For: 1.5
>
> Attachments: SOLR-1506.patch, SOLR-1506.patch, SOLR-1506.patch
>
>
> I need to search over multiple cores, and SOLR-1477 is more
> complicated than expected, so here we'll create a MultiReader
> over the cores to allow searching on them.
> Maybe in the future we can add parallel searching however
> SOLR-1477, if it gets completed, provides that out of the box.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1606) Integrate Near Realtime

2009-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787800#action_12787800
 ] 

Jason Rutherglen commented on SOLR-1606:


The current NRT IndexWriter.getReader API cannot yet support 
IndexReaderFactory, I'll open a Lucene issue.

> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1606) Integrate Near Realtime

2009-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787686#action_12787686
 ] 

Jason Rutherglen commented on SOLR-1606:


I was going to start on the auto-warming using IndexWriter's
IndexReaderWarmer, however because this is heavily cache
dependent I think it'll have to wait for SOLR-1308 because we
need to regenerate the cache per reader. 

> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1606) Integrate Near Realtime

2009-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787621#action_12787621
 ] 

Jason Rutherglen commented on SOLR-1606:


{quote}For example, q=foo&freshness=1000 would cause a new realtime reader to 
be opened of the current one was more than 1000ms old.{quote}

Good idea.

> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1606) Integrate Near Realtime

2009-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787619#action_12787619
 ] 

Jason Rutherglen commented on SOLR-1606:


{quote}In any case, I assume it must not fsync the files, so you
don't get a commit where you know your in a stable
condition?{quote}

OK, right, for the user commit currently means that after the
call, the index is in a stable state, and that it can be
replicated? I agree, for clarity, I'll create a refresh command
and remove the NRT option from the commit command.



> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1606) Integrate Near Realtime

2009-12-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787221#action_12787221
 ] 

Jason Rutherglen commented on SOLR-1606:


bq. Don't we need a new command, like update_realtime

We could however it'd work the same as commit?  Meaning afterwards, all pending 
changes (including deletes) are available?  The commit command is fairly 
overloaded as is.  Are you thinking in terms of replication?

> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1606) Integrate Near Realtime

2009-12-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787206#action_12787206
 ] 

Jason Rutherglen commented on SOLR-1606:


Koji,

Looks like a change to trunk is causing the error, also when I step through it 
passes, when I run without stepping it fails...

> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-433) MultiCore and SpellChecker replication

2009-12-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787155#action_12787155
 ] 

Jason Rutherglen commented on SOLR-433:
---

Are the existing patches for multiple cores or only for spellchecking?

> MultiCore and SpellChecker replication
> --
>
> Key: SOLR-433
> URL: https://issues.apache.org/jira/browse/SOLR-433
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (scripts), spellchecker
>Affects Versions: 1.3
>Reporter: Otis Gospodnetic
> Fix For: 1.5
>
> Attachments: RunExecutableListener.patch, SOLR-433-r698590.patch, 
> SOLR-433.patch, SOLR-433.patch, SOLR-433.patch, SOLR-433.patch, 
> solr-433.patch, SOLR-433_unified.patch, spellindexfix.patch
>
>
> With MultiCore functionality coming along, it looks like we'll need to be 
> able to:
>   A) snapshot each core's index directory, and
>   B) replicate any and all cores' complete data directories, not just their 
> index directories.
> Pulled from the "spellchecker and multi-core index replication" thread - 
> http://markmail.org/message/pj2rjzegifd6zm7m
> Otis:
> I think that makes sense - distribute everything for a given core, not just 
> its index.  And the spellchecker could then also have its data dir (and only 
> index/ underneath really) and be replicated in the same fashion.
> Right?
> Ryan:
> Yes, that was my thought.  If an arbitrary directory could be distributed, 
> then you could have
>   /path/to/dist/index/...
>   /path/to/dist/spelling-index/...
>   /path/to/dist/foo
> and that would all get put into a snapshot.  This would also let you put 
> multiple cores within a single distribution:
>   /path/to/dist/core0/index/...
>   /path/to/dist/core0/spelling-index/...
>   /path/to/dist/core0/foo
>   /path/to/dist/core1/index/...
>   /path/to/dist/core1/spelling-index/...
>   /path/to/dist/core1/foo

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1308) Cache docsets at the SegmentReader level

2009-12-04 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786240#action_12786240
 ] 

Jason Rutherglen commented on SOLR-1308:


{quote} Yeah... that's a pain. We could easily do per-segment
faceting for non-string types though (int, long, etc) since they
don't need to be merged. {quote}

I opened SOLR-1617 for this. I think doc sets can be handled
with a multi doc set (hopefully). Facets however, argh,
FacetComponent is really hairy, though I think it boils down to
simply adding field values of the same up? Then there seems to
be edge cases which I'm scared of. At least it's easy to test
whether we're fulfilling todays functionality by randomly unit
testing per-segment and multi-segment side by side (i.e. if the
results of one are different than the results of the other, we
know there's something to fix).

Perhaps we can initially add up field values, and test that
(which is enough for my project), and move from there. I'd still
like to genericize all of the distributed processes to work over
multiple segments (like Lucene distributed search uses a
MultiSearcher which also works locally), so that local or
distributed is the same API wise. However given I've had trouble
figuring out the existing distributed code (SOLR-1477 ran into a
wall). Maybe as part of SolrCloud
http://wiki.apache.org/solr/SolrCloud, we can rework the
distributed APIs to be more user friendly (i.e. *MultiSearcher
is really easy to understand). If Solr's going to work well in
the cloud, distributed search probably needs to be easy to multi
tier for scaling (i.e. if we have 1 proxy server and 100 nodes,
we could have 1 top proxy, and 1 proxy per 10 nodes, etc). 

> Cache docsets at the SegmentReader level
> 
>
> Key: SOLR-1308
> URL: https://issues.apache.org/jira/browse/SOLR-1308
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets at the top level Multi*Reader level. After a
> commit, the filter/docset caches are flushed. Reloading the
> cache in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources when reloading the filters,
> especially for largish indexes.
> We'll cache docsets at the SegmentReader level. The cache key
> will include the reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1619) Cache documents by their internal ID

2009-12-04 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786233#action_12786233
 ] 

Jason Rutherglen commented on SOLR-1619:


Right, we'd somehow give the user either option.  

> Cache documents by their internal ID
> 
>
> Key: SOLR-1619
> URL: https://issues.apache.org/jira/browse/SOLR-1619
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
>
> Currently documents are cached by their Lucene docid, however we can instead 
> cache them using their schema derived unique id.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1619) Cache documents by their internal ID

2009-12-03 Thread Jason Rutherglen (JIRA)
Cache documents by their internal ID


 Key: SOLR-1619
 URL: https://issues.apache.org/jira/browse/SOLR-1619
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


Currently documents are cached by their Lucene docid, however we can instead 
cache them using their schema derived unique id.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1618) Merge docsets on segment merge

2009-12-03 Thread Jason Rutherglen (JIRA)
Merge docsets on segment merge
--

 Key: SOLR-1618
 URL: https://issues.apache.org/jira/browse/SOLR-1618
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


When SOLR-1308 is implemented, we can save some time when creating new docsets 
by merging them in RAM as segments are merged (similar to LUCENE-1785)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1617) Cache and merge facets per segment

2009-12-03 Thread Jason Rutherglen (JIRA)
Cache and merge facets per segment
--

 Key: SOLR-1617
 URL: https://issues.apache.org/jira/browse/SOLR-1617
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


Spinoff from SOLR-1308.  We'll enable per-segment facet caching and merging 
which will allow near realtime faceted searching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1308) Cache docsets at the SegmentReader level

2009-12-03 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1308:
---

Description: 
Solr caches docsets at the top level Multi*Reader level. After a
commit, the filter/docset caches are flushed. Reloading the
cache in near realtime (i.e. commits every 1s - 2min)
unnecessarily consumes IO resources when reloading the filters,
especially for largish indexes.

We'll cache docsets at the SegmentReader level. The cache key
will include the reader.

  was:
Solr caches docsets and documents at the top level Multi*Reader
level. After a commit, the caches are flushed. Reloading the
caches in near realtime (i.e. commits every 1s - 2min)
unnecessarily consumes IO resources, especially for largish
indexes.

We can cache docsets and documents at the SegmentReader level.
The cache settings in SolrConfig can be applied to the
individual SR caches.

Summary: Cache docsets at the SegmentReader level  (was: Cache docsets 
and docs at the SegmentReader level)

I changed the title because we're not going to cache docs in
this issue (though I think it's possible to cache docs by the
internal id, rather than the doc id). 

Per-segment facet caching and merging per segment can go into a
different issue.

> Cache docsets at the SegmentReader level
> 
>
> Key: SOLR-1308
> URL: https://issues.apache.org/jira/browse/SOLR-1308
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets at the top level Multi*Reader level. After a
> commit, the filter/docset caches are flushed. Reloading the
> cache in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources when reloading the filters,
> especially for largish indexes.
> We'll cache docsets at the SegmentReader level. The cache key
> will include the reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

2009-12-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785433#action_12785433
 ] 

Jason Rutherglen commented on SOLR-1308:


I realized because of UnInvertedField, we'll need to merge facet
results from UIF per reader, so using a MultiDocSet won't help. Can we
leverage the distributed merging FacetComponent implements
(i.e. reuse and/or change the code to work in both the
distributed and local cases)? Ah well, I was hoping for an easy
solution for realtime facets. 

> Cache docsets and docs at the SegmentReader level
> -
>
> Key: SOLR-1308
> URL: https://issues.apache.org/jira/browse/SOLR-1308
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)

2009-12-02 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785014#action_12785014
 ] 

Jason Rutherglen commented on SOLR-1277:


bq. The question then becomes what do you want to make automatic
vs those things that require operator intervention.

Right, I'd like the distributed Solr + ZK system to
automatically failover to another server if there's a functional
software failure. Also, with a search system query times are
very important and if they suddenly drop off on a replicated
server, the node needs to be removed and a new server brought
online (hopefully automatically). If Solr + ZK doesn't take out
a server whose query times are 10 times the average of the other
comparable replicated slave servers, then it 's harder to
justify going live with it, in my humble opinion because it's
not really solving the main reason to use a naming service.

While this may not be functionality we need in an initial
release, it's important to insure our initial design does not
limit future functionality.

> Implement a Solr specific naming service (using Zookeeper)
> --
>
> Key: SOLR-1277
> URL: https://issues.apache.org/jira/browse/SOLR-1277
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.5
>
> Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)

2009-12-02 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784973#action_12784973
 ] 

Jason Rutherglen commented on SOLR-1277:


If we're detecting node failure, it seems the functionality of
Solr should also be detected for failure. The discussions thus
far seem to be around network or process failure which is
usually either intermittent or terminal. Detecting measurable
increase/decreases in CPU, RAM consumption, OOMs, query
failures, indexing failures due to bugs are probably more important than the
network being down because they are harder to detect and fix.

How is HBase handling the detection of functional issues in
relation to ZK?

> Implement a Solr specific naming service (using Zookeeper)
> --
>
> Key: SOLR-1277
> URL: https://issues.apache.org/jira/browse/SOLR-1277
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>    Reporter: Jason Rutherglen
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.5
>
> Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

2009-12-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784668#action_12784668
 ] 

Jason Rutherglen commented on SOLR-1308:


I'm taking a look at this, it's straightforward to cache and
reuse docsets per reader in SolrIndexSearcher, however, we're
passing docsets all over the place (i.e. UnInvertedField). We
can't exactly rip out DocSet without breaking most unit tests,
and writing a bunch of facet merging code. We'd likely lose
functionality? 

Will the MultiDocSet concept SOLR-568 as an easy way to get
something that works up and running? Then we can benchmark and
see if we've lost performance?

> Cache docsets and docs at the SegmentReader level
> -
>
> Key: SOLR-1308
> URL: https://issues.apache.org/jira/browse/SOLR-1308
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1614) Search in Hadoop

2009-11-30 Thread Jason Rutherglen (JIRA)
Search in Hadoop


 Key: SOLR-1614
 URL: https://issues.apache.org/jira/browse/SOLR-1614
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


What's the use case? Sometimes queries are expensive (such as
regex) or one has indexes located in HDFS, that then need to be
searched on. By leveraging Hadoop, these non-time sensitive
queries may be executed without dynamically deploying the
indexes to new Solr servers. 

We'll download the index out of HDFS (assuming they're zipped),
perform the queries in a batch on the index shard, then merge
the results either using a Solr query results priority queue, or
simply using Hadoop's built in merge sorting. 

The query file will be encoded in JSON format, (ID, query,
numresults,fields). The shards file will simply contain newline
delimited paths (HDFS or otherwise). The output can be a Solr
encoded results file per query.

I'm hoping to add an actual Hadoop unit test.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1610) Add generics to SolrCache

2009-11-29 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1610:
---

Attachment: SOLR-1610.patch

Compiles, ran some of the unit tests.  Not sure what else needs to be done?

> Add generics to SolrCache
> -
>
> Key: SOLR-1610
> URL: https://issues.apache.org/jira/browse/SOLR-1610
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Fix For: 1.5
>
> Attachments: SOLR-1610.patch
>
>
> Seems fairly simple for SolrCache to have generics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1610) Add generics to SolrCache

2009-11-29 Thread Jason Rutherglen (JIRA)
Add generics to SolrCache
-

 Key: SOLR-1610
 URL: https://issues.apache.org/jira/browse/SOLR-1610
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 1.5


Seems fairly simple for SolrCache to have generics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Entity Extraction feature

2009-11-29 Thread Jason Rutherglen
Stanford's is open source and works quite well.
http://nlp.stanford.edu/software/CRF-NER.shtml

On Tue, Nov 17, 2009 at 10:25 PM, Pradeep Pujari
 wrote:
> Hello all,
>
> Does Lucene or Solr has entity extraction feature? If so, what is the wiki 
> URL?
>
> Thanks,
> Pradeep.
>
>


SolrCache not using generics?

2009-11-29 Thread Jason Rutherglen
Maybe we can add generics to SolrCache or is there a design reason not to?


[jira] Created: (SOLR-1609) Create a cache implementation that limits itself to a given RAM size

2009-11-29 Thread Jason Rutherglen (JIRA)
Create a cache implementation that limits itself to a given RAM size


 Key: SOLR-1609
 URL: https://issues.apache.org/jira/browse/SOLR-1609
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


This is a spinoff from the unrelated SOLR-1308. We can limit the
cache sizes by estimated RAM usage. I think in some cases this
is a better approach when compared with using soft references as
this will effectively limit the cache RAM used. Soft references
will utilize the max heap before divesting itself of excessive
cached items, which in some cases may not be the desired
behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1606) Integrate Near Realtime

2009-11-28 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1606:
---

Attachment: SOLR-1606.patch

Solr config can have an index nrt (true|false), or commit can specify the nrt 
var.  With nrt=true, when creating a new searcher we call getReader.  

> Integrate Near Realtime 
> 
>
> Key: SOLR-1606
> URL: https://issues.apache.org/jira/browse/SOLR-1606
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1606.patch
>
>
> We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1606) Integrate Near Realtime

2009-11-28 Thread Jason Rutherglen (JIRA)
Integrate Near Realtime 


 Key: SOLR-1606
 URL: https://issues.apache.org/jira/browse/SOLR-1606
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


We'll integrate IndexWriter.getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1578) Develop a Spatial Query Parser

2009-11-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780184#action_12780184
 ] 

Jason Rutherglen commented on SOLR-1578:


GBase http://code.google.com/apis/base/docs/2.0/query-lang-spec.html (Locations 
section at the bottom of the page) has a query syntax for spatial queries (i.e. 
@+40.75-074.00 + 5mi)

> Develop a Spatial Query Parser
> --
>
> Key: SOLR-1578
> URL: https://issues.apache.org/jira/browse/SOLR-1578
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
> Fix For: 1.5
>
>
> Given all the work around spatial, it would be beneficial if Solr had a query 
> parser for dealing with spatial queries.  For starters, something that used 
> geonames data or maybe even Google Maps API would be really useful.  Longer 
> term, a spatial grammar that can robustly handle all the vagaries of 
> addresses, etc. would be really cool.
> Refs: 
> [1] http://www.geonames.org/export/client-libraries.html (note the Java 
> client is ASL)
> [2] Data from geo names: http://download.geonames.org/export/dump/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1506) Search multiple cores using MultiReader

2009-11-09 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1506:
---

Attachment: SOLR-1506.patch

MultiReader doesn't support reopen with the readOnly parameter.  This patch 
adds a test case for commit on the proxy, and a workaround (if unsupported is 
caught, then regular reopen is called).

> Search multiple cores using MultiReader
> ---
>
> Key: SOLR-1506
> URL: https://issues.apache.org/jira/browse/SOLR-1506
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>    Reporter: Jason Rutherglen
>Priority: Trivial
> Fix For: 1.5
>
> Attachments: SOLR-1506.patch, SOLR-1506.patch, SOLR-1506.patch
>
>
> I need to search over multiple cores, and SOLR-1477 is more
> complicated than expected, so here we'll create a MultiReader
> over the cores to allow searching on them.
> Maybe in the future we can add parallel searching however
> SOLR-1477, if it gets completed, provides that out of the box.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   >