Re: Solr & Java 1.6 ... was: Re: [jira] Commented: (SOLR-1873) Commit Solr Cloud to trunk
I'm planning on using Solr Cloud, kind of waiting for the commit to trunk so lets do it (ie, Java6). On Wed, Apr 14, 2010 at 11:32 PM, Ryan McKinley wrote: > I'm fine with 1.6 as a min requirement... but i imagine others have > different opinions :) > > > On Wed, Apr 14, 2010 at 2:53 PM, Yonik Seeley > wrote: >> Yes, it requires that Solr in general is compiled with Java6. We >> should make our lives easier and make Java6 a Solr requirement. >> Zookeeper requires Java6, and we also want Java6 for some of the >> scripting capabilities. >> >> -Yonik >> Apache Lucene Eurocon 2010 >> 18-21 May 2010 | Prague >> >> >> On Wed, Apr 14, 2010 at 2:35 PM, Chris Hostetter >> wrote: >>> >>> I haven't been following the Cloud stuff very closely, can someone clarify >>> what exactly the situation is w/Solr Cloud and Java 1.6. >>> >>> Will merging the cloud changes to trunk require that core pieces of Solr >>> be compiled/run with Java 1.6 (ie: a change to our minimum operating >>> requirements) or will it just require that people wanting cloud >>> management features use a 1.6 JVM and include a new solr contrib and >>> appropriate config options at run time (and this contrib is the only thing >>> that needs to be compiled with 1.6) ? >>> >>> As far as hudson and the build system goes ... there's certainly no reason >>> we can't have more then one setup ... one build using JDK 1.5 (with the >>> build.xml files detecting the JDK version and vocally not building the >>> code that can't be compiled (either just the contrib, or all of solr) and >>> a seperate build using JDK 1.6 that builds and test everything. >>> >>> (having this setup in general would be handy if/when other lucene contribs >>> start wanting to incorporate Java 1.6 features) >>> >>> >>> : bq. As I wrap up the remaining work here, one issue looms: We are going >>> : to need to move Hudson to Java 6 before this can be committed. >>> : >>> : In most respects, I think that would be a positive anyway. Java6 is now >>> : the primary production deployment platform for new projects (and it's >>> : new projects that will be using new lucene and/or solr). With respect >>> : to keeping Lucene Java5 compatible, we can always run the tests with >>> : Java5 before commits (that's what I did in the past when Lucene was on >>> : Java1.4) >>> >>> >>> >>> -Hoss >>> >>> >> >
[jira] Commented: (SOLR-1375) BloomFilter on a field
[ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851637#action_12851637 ] Jason Rutherglen commented on SOLR-1375: {quote}Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?{quote} Adding this to Lucene is out of the scope of what I require, however I don't have time unless it's going to be committed. > BloomFilter on a field > -- > > Key: SOLR-1375 > URL: https://issues.apache.org/jira/browse/SOLR-1375 > Project: Solr > Issue Type: New Feature > Components: update > Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, > SOLR-1375.patch, SOLR-1375.patch > > Original Estimate: 120h > Remaining Estimate: 120h > > * A bloom filter is a read only probabilistic set. Its useful > for verifying a key exists in a set, though it returns false > positives. http://en.wikipedia.org/wiki/Bloom_filter > * The use case is indexing in Hadoop and checking for duplicates > against a Solr cluster (which when using term dictionary or a > query) is too slow and exceeds the time consumed for indexing. > When a match is found, the host, segment, and term are returned. > If the same term is found on multiple servers, multiple results > are returned by the distributed process. (We'll need to add in > the core name I just realized). > * When new segments are created, and commit is called, a new > bloom filter is generated from a given field (default:id) by > iterating over the term dictionary values. There's a bloom > filter file per segment, which is managed on each Solr shard. > When segments are merged away, their corresponding .blm files is > also removed. In a future version we'll have a central server > for the bloom filters so we're not abusing the thread pool of > the Solr proxy and the networking of the Solr cluster (this will > be done sooner than later after testing this version). I held > off because the central server requires syncing the Solr > servers' files (which is like reverse replication). > * The patch uses the BloomFilter from Hadoop 0.20. I want to jar > up only the necessary classes so we don't have a giant Hadoop > jar in lib. > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html > * Distributed code is added and seems to work, I extended > TestDistributedSearch to test over multiple HTTP servers. I > chose this approach rather than the manual method used by (for > example) TermVectorComponent.testDistributed because I'm new to > Solr's distributed search and wanted to learn how it works (the > stages are confusing). Using this method, I didn't need to setup > multiple tomcat servers and manually execute tests. > * We need more of the bloom filter options passable via > solrconfig > * I'll add more test cases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Fixed the unit tests that were failing due to the switch over to using CoreContainer's initZooKeeper method. ZkNodeCoresManager is instantiated in CoreContainer. There's a beginning of a UI in zkcores.jsp I think we still need a core move test. I'm thinking of adding backing up a core as an action that may be performed in a new cores version file. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839937#action_12839937 ] Jason Rutherglen commented on SOLR-1724: I'm starting work on the cores file upload. The cores file is in JSON format, and can be assembled by an entirely different process (i.e. the core assignment creation is decoupled from core deployment). I need to figure out how Solr HTML HTTP file uploading works... There's probably an example somewhere. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839520#action_12839520 ] Jason Rutherglen commented on SOLR-1724: Started on the nodes reporting their status to separate files that are ephemeral nodes, there's no sense in keeping them around if the node isn't up, and the status is legitimately ephemeral. In this case, the status will be something like "Core download 45% (7 GB of 15GB)". > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838926#action_12838926 ] Jason Rutherglen commented on SOLR-1724: In thinking about this some more, in order for the functionality provided in this issue to be more useful, there could be a web based UI to easily view the master cores table. There can additionally be an easy way to upload the new cores version into Zookeeper. I'm not sure if the uploading should be web based or command line, I'm figuring web based, simply because this is more in line with the rest of Solr. As a core is installed or is in the midst of some other process (such as backing itself up), the node/NodeCoresManager can report the ongoing status to Zookeeper. For large cores (i.e. 20 GB) it's important to see how they're doing, and if they're taking too long, begin some remedial action. The UI can display the statuses. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Backing a core up works, at least according to the test case... I will probably begin to test this patch in a staging environment next, where Zookeeper is run in it's own process and a real HDFS cluster is used. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 > Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Zipping from a Lucene directory works and has a test case A ReplicationHandler is added by default under a unique name, if one exists already, we still create our own, for the express purpose of locking an index commit point, zipping it, then uploading it to, for example, HDFS. This part will likely be written next. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837898#action_12837898 ] Jason Rutherglen commented on SOLR-1724: I'm not sure how we'll handle (or if we even need to) installing a new core over an existing core of the same name, in other words core replacement. I think the instanceDir would need to be different, which means we'll need to detect and fail on the case of a new cores version (aka desired state) trying to install itself into an existing core's instanceDir. Otherwise this potential error case is costly in production. It makes me wonder about the shard id in Solr Cloud and how that can be used to uniquely identify an installed core, if a core of a given name is not guaranteed to be the same across Solr servers. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch I added a test case that simulates attempting to install a bad core. Still need to get the backup a Solr core to HDFS working. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837418#action_12837418 ] Jason Rutherglen commented on SOLR-1724: We need a test case with a partial install, and cleaning up any extraneous files afterwards > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836898#action_12836898 ] Jason Rutherglen commented on SOLR-1724: Actually, I just realized the whole exercise of moving a core is pointless, it's exactly the same as replication, so this is a non-issue... I'm going to work on backing up a core to HDFS... > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836896#action_12836896 ] Jason Rutherglen commented on SOLR-1724: I'm taking the approach of simply reusing SnapPuller and a replication handler for each core... This'll be faster to implement and more reliable for the first release (ie I won't run into little wacky bugs because I'll be reusing code that's well tested). > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836022#action_12836022 ] Jason Rutherglen commented on SOLR-1724: We need a URL type parameter to define if a URL in a core info is to a zip file or to a Solr server download point. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836018#action_12836018 ] Jason Rutherglen commented on SOLR-1724: Some further notes... I can reuse the replication code, but am going to place the functionality into core admin handler because it needs to work across cores and not have to be configured in each core's solrconfig. Also, we need to somehow support merging cores... Is that available yet? Looks like merge indexes is only for directories? > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836013#action_12836013 ] Jason Rutherglen commented on SOLR-1724: I think the check on whether a conf file's been modified, to reload the core, can borrow from the replication handler and check the diff based on the checksum of the files... Though this somewhat complicates the storage of the checksum and the resultant JSON file. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835981#action_12835981 ] Jason Rutherglen commented on SOLR-1724: {quote}Will this http access also allow a cluster with incrementally updated cores to replicate a core after a node failure? {quote} You're talking about moving an existing core into HDFS? That's a great idea... I'll add it to the list! Maybe for general "actions" to the system, there can be a ZK directory acting as a queue that contains actions to be performed by the cluster. When the action is completed it's corresponding action file is deleted. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Can we move FileFetcher out of SnapPuller?
Can we move FileFetcher out of SnapPuller? This will assist with reusing the replication handler for moving/copying cores.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835965#action_12835965 ] Jason Rutherglen commented on SOLR-1724: For the above core moving, utilizing the existing Java replication will probably be suitable. However, in all cases we need to copy the contents of all files related to the core (meaning everything under conf and data). How does one accomplish this? > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835955#action_12835955 ] Jason Rutherglen commented on SOLR-1724: Also needed is the ability to move an existing core to a different Solr server. The core will need to be copied via direct HTTP file access, from a Solr server to another Solr server. There is no need to zip the core first. This feature is useful for core indexes that have been incrementally built, then need to be archived (i.e. the index was not constructed using Hadoop). > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835871#action_12835871 ] Jason Rutherglen edited comment on SOLR-1724 at 2/19/10 6:36 PM: - Removing cores seems to work well, on to modified cores... I'm checkpointing progress in case things break, I can easily roll back. was (Author: jasonrutherglen): Removing cores seems to work well, on to modified cores... I checkpointing progress in case things break, I can easily roll back. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Removing cores seems to work well, on to modified cores... I checkpointing progress in case things break, I can easily roll back. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835819#action_12835819 ] Jason Rutherglen commented on SOLR-1724: We need a test case for deleted and modified cores. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Added a way to hold a given number of host or cores files around in ZK, after which, the oldest are deleted. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835513#action_12835513 ] Jason Rutherglen commented on SOLR-1724: I need to add the deletion policy before I can test this in a real environment, otherwise bunches of useless files will pile up in ZK. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Updated to HEAD > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835490#action_12835490 ] Jason Rutherglen commented on SOLR-1724: I need to figure out how integrate this with the Solr Cloud distributed search stuff... Hmm... Maybe I'll start with the Solr Cloud test cases? > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch * No-commit * NodeCoresManagerTest.testInstallCores works * There's HDFS test cases using MiniDFSCluster > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 > Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch No-commit NodeCoresManager[Test] needs more work A CoreController matchHosts unit test was added to CoreControllerTest > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834539#action_12834539 ] Jason Rutherglen commented on SOLR-1724: There's a wiki for this issue where the general specification is defined: http://wiki.apache.org/solr/DeploymentofSolrCoreswithZookeeper > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833108#action_12833108 ] Jason Rutherglen commented on SOLR-1301: There still seems to be a bug where the temporary directory index isn't deleted on job completion. > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1395) Integrate Katta
[ https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832587#action_12832587 ] Jason Rutherglen commented on SOLR-1395: shyjuThomas, It'd be good to update this patch to the latest Katta... You're welcome to do so... For my project I only need what'll be in SOLR-1724... > Integrate Katta > --- > > Key: SOLR-1395 > URL: https://issues.apache.org/jira/browse/SOLR-1395 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, > katta.node.properties, katta.zk.properties, log4j-1.2.13.jar, > solr-1395-1431-3.patch, solr-1395-1431-4.patch, solr-1395-1431.patch, > SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, > test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, zookeeper-3.2.1.jar > > Original Estimate: 336h > Remaining Estimate: 336h > > We'll integrate Katta into Solr so that: > * Distributed search uses Hadoop RPC > * Shard/SolrCore distribution and management > * Zookeeper based failover > * Indexes may be built using Hadoop -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1761) Command line Solr check softwares
[ https://issues.apache.org/jira/browse/SOLR-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1761: --- Attachment: SOLR-1761.patch Here's a cleaned up, commitable version > Command line Solr check softwares > - > > Key: SOLR-1761 > URL: https://issues.apache.org/jira/browse/SOLR-1761 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 > Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: SOLR-1761.patch, SOLR-1761.patch > > > I'm in need of a command tool Nagios and the like can execute that verifies a > Solr server is working... Basically it'll be a jar with apps that return > error codes if a given criteria isn't met. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1761) Command line Solr check softwares
[ https://issues.apache.org/jira/browse/SOLR-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1761: --- Attachment: SOLR-1761.patch No-commit Here's a couple apps that: 1) Check the query time 2) Check the last replication time They exit with error code 1 on failure, 0 on success > Command line Solr check softwares > - > > Key: SOLR-1761 > URL: https://issues.apache.org/jira/browse/SOLR-1761 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 > Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: SOLR-1761.patch > > > I'm in need of a command tool Nagios and the like can execute that verifies a > Solr server is working... Basically it'll be a jar with apps that return > error codes if a given criteria isn't met. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Real-time deletes
Hello there dude... I started on this, http://issues.apache.org/jira/browse/SOLR-1606 However since then things have changed, so it may not work... You're welcome to continue on it... Cheers, Jason On Tue, Feb 9, 2010 at 3:20 PM, Kaktu Chakarabati wrote: > Hey Guys, > havent heard back from anyone - Would really appreciate any response what so > ever (even a 'extremely not feasible right now'), just so i know > if to try and pursue this direction or abandon.. > > Thanks, > -Chak > > On Fri, Feb 5, 2010 at 11:41 AM, KaktuChakarabati wrote: > >> >> Hey, >> some time ago I asked around and found out that lucene has inbuilt support >> pretty much for propagating deletes to the active index without a lengthy >> commit ( I do not remember the exact semantics but I believe it involves >> using an IndexReader reopen() method or so). >> I wanted to check back and find out whether solr now makes use of this in >> any way - Otherwise, is anyone working on such a feature - And Otherwise, >> if >> i'd like to pick up the glove on this, what would be a correct way, >> architecture-wise to go about it ? implement as a separate UpdateHandler / >> flag..? >> >> Thanks, >> -Chak >> -- >> View this message in context: >> http://old.nabble.com/Real-time-deletes-tp27472975p27472975.html >> Sent from the Solr - Dev mailing list archive at Nabble.com. >> >> >
[jira] Created: (SOLR-1761) Command line Solr check softwares
Command line Solr check softwares - Key: SOLR-1761 URL: https://issues.apache.org/jira/browse/SOLR-1761 Project: Solr Issue Type: New Feature Affects Versions: 1.4 Reporter: Jason Rutherglen Fix For: 1.5 I'm in need of a command tool Nagios and the like can execute that verifies a Solr server is working... Basically it'll be a jar with apps that return error codes if a given criteria isn't met. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829200#action_12829200 ] Jason Rutherglen commented on SOLR-1301: In production the latest patch does not leave temporary files behind... Though before we had failed tasks, so perhaps there's still a bug, we won't know until we run out of disk space again. > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1301: --- Attachment: SOLR-1301.patch I added the following to the SRW.close method's finally clause: {code} FileUtils.forceDelete(new File(temp.toString())); {code} > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828368#action_12828368 ] Jason Rutherglen commented on SOLR-1301: I'm testing deleting the temp dir in SRW.close finally... > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828172#action_12828172 ] Jason Rutherglen commented on SOLR-1301: There's a bug caused by the latest change: {quote} java.io.IOException: java.lang.IllegalArgumentException: Wrong FS: hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_01_0.1.index-a, expected: file:/// at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:371) at com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:147) at com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:103) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_01_0.1.index-a, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:305) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.solr.hadoop.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:459) at org.apache.solr.hadoop.SolrRecordWriter.packZipFile(SolrRecordWriter.java:390) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:362) ... 5 more {quote} > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1301: --- Attachment: SOLR-1301.patch This update include's Kevin's recommended path change > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Here's an update, we're onto the actual Solr node portion of the code, and some tests around that. I'm focusing on downloading cores out of HDFS because that's my use case. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, > SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: gson-1.4.jar hadoop-0.20.2-dev-test.jar hadoop-0.20.2-dev-core.jar Hadoop and Gson dependencies > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, gson-1.4.jar, > hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804773#action_12804773 ] Jason Rutherglen commented on SOLR-1724: For some reason ZkTestServer doesn't need to be shutdown any longer? > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804760#action_12804760 ] Jason Rutherglen commented on SOLR-1724: The ZK port changed in ZkTestServer > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804750#action_12804750 ] Jason Rutherglen commented on SOLR-1724: I did an svn update, though now am seeing the following error: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper within 5000 ms at org.apache.solr.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:131) at org.apache.solr.cloud.SolrZkClient.(SolrZkClient.java:106) at org.apache.solr.cloud.SolrZkClient.(SolrZkClient.java:72) at org.apache.solr.cloud.CoreControllerTest.testCores(CoreControllerTest.java:48) > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804655#action_12804655 ] Jason Rutherglen commented on SOLR-1724: Need to have a command line tool that dumps the state of the existing cluster from ZK, out to a json file for a particular version. For my setup I'll have a program that'll look at this cluster state file and generate an input file that'll be written to ZK, which essentially instructs the Solr nodes to match the new cluster state. This allows me to easily write my own functionality that operates on the cluster that's external to deploying new software into Solr. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804590#action_12804590 ] Jason Rutherglen commented on SOLR-1724: {quote}If you know your going to not store file data at nodes that have children (the only way that downloading to a real file system makes sense), you could just call getChildren - if there are children, its a dir, otherwise its a file. Doesn't work for empty dirs, but you could also just do getData, and if it returns null, treat it as a dir, else treat it as a file.{quote} Thanks Mark... > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803943#action_12803943 ] Jason Rutherglen commented on SOLR-1724: Do we have some code that recursively downloads a tree of files from ZK? The challenge is I don't see a way to find out if a given path represents a directory or not. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: commons-lang-2.4.jar commons-lang-2.4.jar is required > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: commons-lang-2.4.jar, SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1724: --- Attachment: SOLR-1724.patch Here's the first cut... I agree, I'm not really into ephemeral ZK nodes for Solr hosts/nodes. The reason is contact with ZK is highly superficial and can be intermittent. I'm mostly concerned with insuring the core operations succeed on a given server. If a server goes down, there needs to be more than ZK to prove it, and if it goes down completely, I'll simply reallocate it's cores to another server using the core management mechanism provided in this patch. The issue is still being worked on, specifically the Solr server portion that downloads the cores from some location, or performs operations. The file format will move to json. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > Attachments: SOLR-1724.patch > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802526#action_12802526 ] Jason Rutherglen commented on SOLR-1301: I started on the Solr wiki page for this guy... http://wiki.apache.org/solr/HadoopIndexing > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801244#action_12801244 ] Jason Rutherglen commented on SOLR-1724: This'll be a patch on the cloud branch to reuse what's started, I don't see any core management code in there yet, so this looks complimentary. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801216#action_12801216 ] Jason Rutherglen commented on SOLR-1724: Ted, Thanks for the Katta link. This patch will likely de-emphasize the distributed search part, which is where the ephemeral node is used (i.e. a given server lists it's current state). I basically want to take care of this one little deployment aspect of cores, improving on the wacky hackedy system I'm running today. Then IF it works, then I'll look at the distributed search part, hopefully in a totally separate patch. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801215#action_12801215 ] Jason Rutherglen commented on SOLR-1724: Note to self: I need a way to upload an empty core/confdir from the command line, basically into ZK, then reference that core from ZK (I think this'll work?). I'd rather not rely on a separate http server or something... The size of a jared up Solr conf dir shouldn't be too much for ZK? > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore > Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800994#action_12800994 ] Jason Rutherglen commented on SOLR-1724: Additionally, upon successful completion of a core-version deployment to a set of nodes, then a customizable deletion policy like thing will be default, cleanup the old cores on the system. > Real Basic Core Management with Zookeeper > - > > Key: SOLR-1724 > URL: https://issues.apache.org/jira/browse/SOLR-1724 > Project: Solr > Issue Type: New Feature > Components: multicore >Affects Versions: 1.4 >Reporter: Jason Rutherglen > Fix For: 1.5 > > > Though we're implementing cloud, I need something real soon I can > play with and deploy. So this'll be a patch that only deploys > new cores, and that's about it. The arch is real simple: > On Zookeeper there'll be a directory that contains files that > represent the state of the cores of a given set of servers which > will look like the following: > /production/cores-1.txt > /production/cores-2.txt > /production/core-host-1-actual.txt (ephemeral node per host) > Where each core-N.txt file contains: > hostname,corename,instanceDir,coredownloadpath > coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, > etc > and > core-host-actual.txt contains: > hostname,corename,instanceDir,size > Everytime a new core-N.txt file is added, the listening host > finds it's entry in the list and begins the process of trying to > match the entries. Upon completion, it updates it's > /core-host-1-actual.txt file to it's completed state or logs an error. > When all host actual files are written (without errors), then a > new core-1-actual.txt file is written which can be picked up by > another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1724) Real Basic Core Management with Zookeeper
Real Basic Core Management with Zookeeper - Key: SOLR-1724 URL: https://issues.apache.org/jira/browse/SOLR-1724 Project: Solr Issue Type: New Feature Components: multicore Affects Versions: 1.4 Reporter: Jason Rutherglen Fix For: 1.5 Though we're implementing cloud, I need something real soon I can play with and deploy. So this'll be a patch that only deploys new cores, and that's about it. The arch is real simple: On Zookeeper there'll be a directory that contains files that represent the state of the cores of a given set of servers which will look like the following: /production/cores-1.txt /production/cores-2.txt /production/core-host-1-actual.txt (ephemeral node per host) Where each core-N.txt file contains: hostname,corename,instanceDir,coredownloadpath coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, etc and core-host-actual.txt contains: hostname,corename,instanceDir,size Everytime a new core-N.txt file is added, the listening host finds it's entry in the list and begins the process of trying to match the entries. Upon completion, it updates it's /core-host-1-actual.txt file to it's completed state or logs an error. When all host actual files are written (without errors), then a new core-1-actual.txt file is written which can be picked up by another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Solr Cloud wiki and branch notes
> This is really about doing not-so-much in the very near term, > while thinking ahead to the longer term. Lets have a page dedicated to release 1.0 of cloud? I feel uncomfortable editing the existing wiki because I don't know what the plans are for the first release. I need to revisit Katta as my short term plans include using Zookeeper (not for failover) but simply for deploying shards/cores to servers, and nothing else. I can use the core admin interface to bring them online, update them etc. Or I'll just implement something and make a patch to Solr... Thinking out loud: /anyname/shardlist-v1.txt /anyname/shardlist-v2.txt where shardlist-v1.txt contains: corename,coredownloadpath,instanceDir Where coredownloadpath can be any URL including hftp, hdfs, ftp, http, https. Where the system automagically uninstalls cores that should no longer exist on a given server. Cores with the same name deployed to the same server would use the reload command, otherwise the create command. Where there's a ZK listener on the /anyname directory for new files that are greater than the last known installed shardlist.txt. Alternatively, an even simpler design would be uploading a solr.xml file per server, something like: /anyname/solr-prod01.solr.xml Which a directory listener on each server parses and makes the necessary changes (without restarting Tomcat). On the search side in this system, I'd need to wait for the cores to complete their install, then swap in a new core on the search proxy that represents the new version of the corelist, then the old cores could go away. This isn't very different than the segmentinfos system used in Lucene IMO. On Fri, Jan 15, 2010 at 1:53 PM, Yonik Seeley wrote: > On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen > wrote: >> The page is huge, which signals to me maybe we're trying to do >> too much > > This is really about doing not-so-much in the very near term, while > thinking ahead to the longer term. > >> Revamping distributed search could be in a different branch >> (this includes partial results) > > That could just be a separate patch - it's scope is not that broad (I > think there may already be a JIRA issue open for it). > >> Having a single solrconfig and schema for each core/shard in a >> collection won't work for me. I need to define each core >> externally, and I don't want Solr-Cloud to manage this, how will >> this scenario work? > > We do plan on each core being able to have it's own schema (so one > could try out a version of a schema and gradually migrate the > cluster). > > It could also be possible to define a schema as "local" (i.e. use the > one on the local file system) > >> A host is about the same as node, I don't see the difference, or >> enough of one > > A host is the hardware. It will have limited disk, limited CPU, etc. > At some point we will want to model this... multiple nodes could be > launched on one box. We're not doing anything with it now, and won't > in the near future. > >> Cluster resizing and rebalancing can and should be built >> externally and hopefully after an initial release that does the >> basics well > > The initial release will certainly not be doing any resizing or rebalancing. > We should allow this to be done externally. In the future, we > shouldn't require that this be done externally though (i.e. we should > somehow alow the cluster to grow w/o people having to write code). > >> Collection is a group of cores? > > A collection of documents - the complete search index. It has a > single schema, etc. > > -Yonik > http://www.lucidimagination.com >
Solr Cloud wiki and branch notes
Here's some rough notes after running the unit tests, reviewing some of the code (though not understanding it), and reviewing the wiki page http://wiki.apache.org/solr/SolrCloud We need a protocol in the URL, otherwise it's inflexible I'm overwhelmed with all the ?? question areas of the document. The page is huge, which signals to me maybe we're trying to do too much Revamping distributed search could be in a different branch (this includes partial results) Having a single solrconfig and schema for each core/shard in a collection won't work for me. I need to define each core externally, and I don't want Solr-Cloud to manage this, how will this scenario work? A host is about the same as node, I don't see the difference, or enough of one Cluster resizing and rebalancing can and should be built externally and hopefully after an initial release that does the basics well Collection is a group of cores? I like the model -> reality system. However how does the versioning work? We need to know what the conversion progress is? How will the queuing of in-progress alterations work (this seems hard, I'd rather focus on this, make it work well, than mess with other things like load balancing in the first release? i.e. if this doesn't work well, Solr-Cloud isn't production ready for me) Shard Identification, this falls under too ambitious right now IMO I think we need a wiki page of just the basics of core/shard management, implement that, then build all the rest of the features on top... Otherwise this thing feels like it's going to be a nightmare to test and deploy in production.
Re: [jira] Commented: (SOLR-1301) Solr + Hadoop
Copying files ala HDFS is trivial because it's sequential, Lucene merging isn't, so scaling merging over 20 machines vs 4 Solr has clear advantages... That and on-demand expandability, so I can reindex 2 terabytes of data in half a day vs weeks or more with 4 Solr masters has compelling advantages. On Fri, Jan 15, 2010 at 12:09 PM, Grant Ingersoll wrote: > I can see why that is a win over the existing, but I still don't get why it > wouldn't be faster just to index to a suite of Solr master indexers and save > all this file slogging around. But, I guess that is a separate patch all > together. > > > > On Jan 15, 2010, at 2:35 PM, Jason Rutherglen wrote: > >> Zipping cores/shards is in the latest patch... >> >> On Fri, Jan 15, 2010 at 11:22 AM, Andrzej Bialecki wrote: >>> On 2010-01-15 20:13, Ted Dunning wrote: >>>> >>>> This can also be a big performance win. Jason Venner reports significant >>>> index and cluster start time improvements by indexing to local disk, >>>> zipping >>>> and then uploading the resulting zip file. Hadoop has significant file >>>> open >>>> overhead so moving one zip file wins big over many index component files. >>>> There is a secondary bandwidth win as well. >>> >>> Indeed, this one should be easy to add to this patch. Unless Jason & Jason >>> already cooked a patch for this? ;) >>> >>>> >>>> On Fri, Jan 15, 2010 at 8:34 AM, Andrzej Bialecki >>>> (JIRA)wrote: >>>> >>>>> >>>>> HDFS doesn't support enough POSIX to support writing Lucene indexes >>>>> directly to HDFS - for this reason indexes are always created on local >>>>> storage of each node, and then after closing they are copied to HDFS. >>> >>> >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >
Re: [jira] Commented: (SOLR-1301) Solr + Hadoop
Zipping cores/shards is in the latest patch... On Fri, Jan 15, 2010 at 11:22 AM, Andrzej Bialecki wrote: > On 2010-01-15 20:13, Ted Dunning wrote: >> >> This can also be a big performance win. Jason Venner reports significant >> index and cluster start time improvements by indexing to local disk, >> zipping >> and then uploading the resulting zip file. Hadoop has significant file >> open >> overhead so moving one zip file wins big over many index component files. >> There is a secondary bandwidth win as well. > > Indeed, this one should be easy to add to this patch. Unless Jason & Jason > already cooked a patch for this? ;) > >> >> On Fri, Jan 15, 2010 at 8:34 AM, Andrzej Bialecki >> (JIRA)wrote: >> >>> >>> HDFS doesn't support enough POSIX to support writing Lucene indexes >>> directly to HDFS - for this reason indexes are always created on local >>> storage of each node, and then after closing they are copied to HDFS. > > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800802#action_12800802 ] Jason Rutherglen commented on SOLR-1301: bq. Hadoop streaming the output of the reduce tasks to the Solr indexing servers. Yes, this is what we've implemented, it's just normal Solr HTTP based indexing, right? It works well to a limited degree, and for the particular implementation details, there are reasons why this can be less than ideal. The balanced, distributed shards/cores system works far better and enables us to use less hardware (but I'm not going into all the details here). One issue I can mention, is the switch over to a new set of incremental servers (which happens then the old servers fill up), I'm looking to automate this, and will likely focus on it and the core management in the cloud branch. > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800775#action_12800775 ] Jason Rutherglen commented on SOLR-1301: {quote}What I meant was the Hadoop job could simply know what the set of master indexers are and send the documents directly to them{quote} One can use Hadoop for this purpose, we have implemented the system in this way for the incremental indexes, however it doesn't require a separate patch or contrib module. The problem with the Hadoop streaming model is it doesn't scale well, if for example, we need to reindex using the CJKAnalyzer, or using Basis' analyzer etc. We use SOLR-1301 for reindexing loads of data, as fast as possible by parallelizing the indexing. There are lots of little things I'd like to add to the functionality, though, implementing ZK based core management takes a higher priority, as I spend a lot of time doing this manually today. > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800756#action_12800756 ] Jason Rutherglen commented on SOLR-1301: Andrzej's model works great in production. We have both 1) master -> slave for incremental updates, and 2) index in Hadoop with this patch, we then deploy each new core/shard in a balanced fashion to many servers. They're two separate modalities. The ZK stuff (as it's modeled today) isn't useful here, because I want the schema I indexed with as a part of the zip file stored in HDFS (or S3, or wherever). Any sort of ZK thingy is good for managing the core/shards across many servers, however Katta does this already (so we're either reinventing the same thing, not necessarily a bad thing if we also have a clear path for incremental indexing, as discussed above). Ultimately, the Solr server can be viewed as simply a container for cores, and the cloud + ZK branch as a manager of cores/shards. Anything more ambitious will probably be overkill, and this is what I believe Ted has been trying to get at. > Solr + Hadoop > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SolrCloud logical shards
> The point I was trying to make is that I believe that if you start changing > terminologies now people will be very confused So shard -> remote core... Slice -> core group. Though semantically they're synonyms. In any case, I need to spend some time looking at the cloud branch, and less time jibber-jabberin' about it. On Fri, Jan 15, 2010 at 1:24 AM, Uri Boness wrote: >> >> Can you elaborate on what you mean, isn't a core a single index >> too? It seems like shard was used to represent a remote index >> (perhaps?). > > Yes, a core is a single index and a shard is a conceptual idea which at the > moment concretely refers to a remote core (but not a specific one as the > same shard can be represented by multiple core replicas). The point I was > trying to make is that I believe that if you start changing terminologies > now people will be very confused. And I thought of sticking to Yonik's > suggestion of a "slice" just to prevent this confusion. On the other hand > one can argue that the terminology as it is today is already confusing... > and if you really want to get it right and be aligned with the "rest of the > world" (if there is such a thing... from what I've seen so far sharding is > used differently in different contexts), then perhaps a "good" timing for > making such terminology changes is with a major release (Solr 2.0?) as with > such release people tend to be more open for new/changed concepts. > > Cheers, > Uri > > Jason Rutherglen wrote: >> >> Uri, >> >> >>> >>> "core" to represent a single index and "shard" to be >>> represented by a single core >>> >> >> Can you elaborate on what you mean, isn't a core a single index >> too? It seems like shard was used to represent a remote index >> (perhaps?). Though here I'd prefer "remote core", because to the >> uninitiated Solr outsider it's immediately obvious (i.e. they >> need only know what a core is, in the Solr glossary or term >> dictionary). >> >> In Google vernacular, which is where the name shard came from, a >> "shard" is basically a local sub-index >> http://research.google.com/archive/googlecluster.html where >> there would be many "shards" per server. However that's a >> digression at this point. >> >> I personally prefer relatively straightforward names, that are >> self-evident, rather than inventing new language for fairly >> simple concepts. Slice, even though it comes from our buddy >> Yonik, probably doesn't make any immediate sense to external >> users when compared with the word shard. Of course software >> projects have a tendency to create their own words to somewhat >> mystify users into believing in some sort of magic occurring >> underneath. If that's what we're after, it's cool, I mean that >> makes sense. And I don't mean to be derogatory here however this >> is an open source project created in part to educate users on >> search and be made easily accessible as possible, to the >> greatest number of users possible. I think Doug did a create job >> of this when Lucene started with amazingly succinct code for >> fairly complex concepts (eg, anti-mystification of search). >> >> Jason >> >> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: >> >>> >>> Although Jason has some valid points here, I'm with Yonik here. I do >>> believe >>> that we've gotten used to the terms "core" to represent a single index >>> and >>> "shard" to be represented by a single core. A "node" seems to indicate a >>> machine or a JVM. Changing any of these (informal perhaps) definitions >>> will >>> only cause confusion. That's why I think a "slice" is a good solution >>> now... >>> first it's a new term to a new view of the index (logical shard AFAIK >>> don't >>> really exists yet) so people won't need to get used to it, but it's also >>> descriptive and intuitive. I do like Jason's idea about having a protocol >>> attached to the URL's. >>> >>> Cheers, >>> Uri >>> >>> Jason Rutherglen wrote: >>> >>>>> >>>>> But I've kind of gotten used to thinking of shards as the >>>>> actual physical queryable things... >>>>> >>>>> >>>> >>>> I think a mistake was made referring to Solr cores
Re: SolrCloud logical shards
Uri, > "core" to represent a single index and "shard" to be > represented by a single core Can you elaborate on what you mean, isn't a core a single index too? It seems like shard was used to represent a remote index (perhaps?). Though here I'd prefer "remote core", because to the uninitiated Solr outsider it's immediately obvious (i.e. they need only know what a core is, in the Solr glossary or term dictionary). In Google vernacular, which is where the name shard came from, a "shard" is basically a local sub-index http://research.google.com/archive/googlecluster.html where there would be many "shards" per server. However that's a digression at this point. I personally prefer relatively straightforward names, that are self-evident, rather than inventing new language for fairly simple concepts. Slice, even though it comes from our buddy Yonik, probably doesn't make any immediate sense to external users when compared with the word shard. Of course software projects have a tendency to create their own words to somewhat mystify users into believing in some sort of magic occurring underneath. If that's what we're after, it's cool, I mean that makes sense. And I don't mean to be derogatory here however this is an open source project created in part to educate users on search and be made easily accessible as possible, to the greatest number of users possible. I think Doug did a create job of this when Lucene started with amazingly succinct code for fairly complex concepts (eg, anti-mystification of search). Jason On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: > Although Jason has some valid points here, I'm with Yonik here. I do believe > that we've gotten used to the terms "core" to represent a single index and > "shard" to be represented by a single core. A "node" seems to indicate a > machine or a JVM. Changing any of these (informal perhaps) definitions will > only cause confusion. That's why I think a "slice" is a good solution now... > first it's a new term to a new view of the index (logical shard AFAIK don't > really exists yet) so people won't need to get used to it, but it's also > descriptive and intuitive. I do like Jason's idea about having a protocol > attached to the URL's. > > Cheers, > Uri > > Jason Rutherglen wrote: >>> >>> But I've kind of gotten used to thinking of shards as the >>> actual physical queryable things... >>> >> >> I think a mistake was made referring to Solr cores as shards. >> It's the same thing with 2 different names. Slices adds yet >> another name which seems to imply the same thing yet again. I'd >> rather see disambiguation here, and call them cores (partially >> because that's what's in the code and on the wiki), and cores >> only. It's a Solr specific term, it's going to be confused with >> microprocessor cores, but at least there's only one name, which >> as search people, we know creates fewer posting lists :). >> >> Logical groupings of cores can occur, which can be aptly named >> core groups. This way I can submit a query to a core group, and >> it's reasonable to assume I'm hitting N cores. Further, cores >> could point to a logical or physical entity via a URL. (As a >> side note, I've always found it odd that the shards param to >> RequestHandler lacks the protocol, what if I want to use HTTPS >> for example?). >> >> So there could be http://host/solr/core1 (physical), >> core://megacorename (logical), >> coregroup://supergreatcoregroupname (a group of cores) in the >> "shards" parameter (whose name can perhaps be changed for >> clarity in a future release). Then people can mix and match and >> we won't have many different XML elements floating around. We'd >> have a simple list of URLs that are transposed into a real >> physical network request. >> >> >> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley >> wrote: >> >>> >>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley >>> wrote: >>> >>>> >>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley >>>> wrote: >>>> >>>>> >>>>> I'm actually starting to lean toward "slice" instead of "logical >>>>> shard". >>>>> >>> >>> Alternate terminology could be "index" for the actual physical lucene >>> lindex (and also enough of the URL that unambiguously identifies it), >>> and then "shard" could be the logical entity. >>> >>> But I've kind of gotten used to thinking of shards as the actual >>> physical queryable things... >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >> >> >
Re: SolrCloud logical shards
> But I've kind of gotten used to thinking of shards as the > actual physical queryable things... I think a mistake was made referring to Solr cores as shards. It's the same thing with 2 different names. Slices adds yet another name which seems to imply the same thing yet again. I'd rather see disambiguation here, and call them cores (partially because that's what's in the code and on the wiki), and cores only. It's a Solr specific term, it's going to be confused with microprocessor cores, but at least there's only one name, which as search people, we know creates fewer posting lists :). Logical groupings of cores can occur, which can be aptly named core groups. This way I can submit a query to a core group, and it's reasonable to assume I'm hitting N cores. Further, cores could point to a logical or physical entity via a URL. (As a side note, I've always found it odd that the shards param to RequestHandler lacks the protocol, what if I want to use HTTPS for example?). So there could be http://host/solr/core1 (physical), core://megacorename (logical), coregroup://supergreatcoregroupname (a group of cores) in the "shards" parameter (whose name can perhaps be changed for clarity in a future release). Then people can mix and match and we won't have many different XML elements floating around. We'd have a simple list of URLs that are transposed into a real physical network request. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley wrote: > On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley > wrote: >> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley >> wrote: >>> I'm actually starting to lean toward "slice" instead of "logical shard". > > Alternate terminology could be "index" for the actual physical lucene > lindex (and also enough of the URL that unambiguously identifies it), > and then "shard" could be the logical entity. > > But I've kind of gotten used to thinking of shards as the actual > physical queryable things... > > -Yonik > http://www.lucidimagination.com >
[jira] Commented: (SOLR-1720) replication configuration bug with multiple replicateAfter values
[ https://issues.apache.org/jira/browse/SOLR-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799843#action_12799843 ] Jason Rutherglen commented on SOLR-1720: For consistency maybe we should support comma delimited lists? I edit the shards a lot (comma delimited), which could use different elements as well, so by rote, I just used commas for this, because it seemed like a Solr standard... Thanks for clarifying! > replication configuration bug with multiple replicateAfter values > - > > Key: SOLR-1720 > URL: https://issues.apache.org/jira/browse/SOLR-1720 > Project: Solr > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Yonik Seeley > Fix For: 1.5 > > > Jason reported problems with Multiple replicateAfter values - it worked after > changing to just "commit" > http://www.lucidimagination.com/search/document/e4c9ba46dc03b031/replication_problem -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1709) Distributed Date Faceting
[ https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797898#action_12797898 ] Jason Rutherglen commented on SOLR-1709: Tim, Thanks for the patch... bq. as I'm having a bit of trouble with svn (don't shoot me, but my environment is a Redmond-based os company). TortoiseSVN works well on Windows, even for creating patches. Have you tried it? > Distributed Date Faceting > - > > Key: SOLR-1709 > URL: https://issues.apache.org/jira/browse/SOLR-1709 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other >Affects Versions: 1.4 >Reporter: Peter Sturge >Priority: Minor > > This patch is for adding support for date facets when using distributed > searches. > Date faceting across multiple machines exposes some time-based issues that > anyone interested in this behaviour should be aware of: > Any time and/or time-zone differences are not accounted for in the patch > (i.e. merged date facets are at a time-of-day, not necessarily at a universal > 'instant-in-time', unless all shards are time-synced to the exact same time). > The implementation uses the first encountered shard's facet_dates as the > basis for subsequent shards' data to be merged in. > This means that if subsequent shards' facet_dates are skewed in relation to > the first by >1 'gap', these 'earlier' or 'later' facets will not be merged > in. > There are several reasons for this: > * Performance: It's faster to check facet_date lists against a single map's > data, rather than against each other, particularly if there are many shards > * If 'earlier' and/or 'later' facet_dates are added in, this will make the > time range larger than that which was requested > (e.g. a request for one hour's worth of facets could bring back 2, 3 > or more hours of data) > This could be dealt with if timezone and skew information was added, and > the dates were normalized. > One possibility for adding such support is to [optionally] add 'timezone' and > 'now' parameters to the 'facet_dates' map. This would tell requesters what > time and TZ the remote server thinks it is, and so multiple shards' time data > can be normalized. > The patch affects 2 files in the Solr core: > org.apache.solr.handler.component.FacetComponent.java > org.apache.solr.handler.component.ResponseBuilder.java > The main changes are in FacetComponent - ResponseBuilder is just to hold the > completed SimpleOrderedMap until the finishStage. > One possible enhancement is to perhaps make this an optional parameter, but > really, if facet.date parameters are specified, it is assumed they are > desired. > Comments & suggestions welcome. > As a favour to ask, if anyone could take my 2 source files and create a PATCH > file from it, it would be greatly appreciated, as I'm having a bit of trouble > with svn (don't shoot me, but my environment is a Redmond-based os company). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
solr-dev@lucene.apache.org
[ https://issues.apache.org/jira/browse/SOLR-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793474#action_12793474 ] Jason Rutherglen commented on SOLR-1665: Plus one, visibility into the components would be good. This'll work for distributed processes (i.e. time taken on each node per component)? > Add &debugTimings param so that timings for components can be retrieved > without having to do explains(), as in &debugQuery > -- > > Key: SOLR-1665 > URL: https://issues.apache.org/jira/browse/SOLR-1665 > Project: Solr > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.5 > > > As the title says, it would be great if we could just get back component > timings w/o having to do the full boat of explains and other stuff. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)
[ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793358#action_12793358 ] Jason Rutherglen commented on SOLR-1277: {quote}Zookeeper gives us the layout of the cluster. It doesn't seem like we need (yet) fast failure detection from zookeeper - other nodes can do this synchronously themselves (and would need to anyway) on things like connection failures. App-level timeouts should not mark the node as failed since we don't know how long the request was supposed to take.{quote} Google Chubby when used in conjunction with search sets a high timeout of 60 seconds I believe? Fast failover is difficult so it'll be best to enable fast re-requesting to adjacent slave servers on request failure. Mahadev has some good advise about how we can separate the logic into different znodes. Going further I think we'll want to allow cores to register themselves, then listen to a separate directory as to what state each should be in. We'll need to insure the architecture allows for defining multiple tiers (like a pyramid). At http://wiki.apache.org/solr/ZooKeeperIntegration is a node a core or a server/corecontainer? To move ahead we'll really need to define and settle on the directory and file structure. I believe the requirement of grouping cores so that one may issue a search against a group name, instead of individual shard names will be useful. The ability to move cores to different nodes will be necessary, as is the ability to replicate cores (i.e. have multiple copies available on different servers). Today I deploy lots of cores today from HDFS across quite a few servers containing 1.6 billion documents representing at least 2.4 TB of data. I mention this because a lot can potentially go wrong in this type of setup (i.e. server's going down, corrupted data, intermittent network, etc) I generate a file that contains all the information as to which core should go to which Solr server using size based balancing. Ideally I'd be able to generate a new file, perhaps for load balancing the cores across new Solr servers or to define that hot cores should be replicated, and the Solr cluster would move the cores to the defined servers automatically. This doesn't include the separate set of servers system that handles incremental updates (i.e. master -> slave). There's a bit of trepidation in moving forward on this because we don't want to engineer ourselves into a hole, however if we need to change the structure of the znodes in the future, we'll need a healthy a versioning plan such that one may upgrade a cluster while maintaining backwards compatibility on a live system. Lets think of a basic plan for this. In conclusion, lets iterate on the directory structure via the wiki or this issue? {quote}A search node can have very large caches tied to readers that all drop at once on commit, and can require a much larger heap to accommodate these caches. I think thats a more common scenario that creates these longer pauses.{quote} The large cache issue should be fixable with the various NRT changes SOLR-1606. They're collectively not much different than the search and sort per segment changes made to Lucene 2.9. > Implement a Solr specific naming service (using Zookeeper) > -- > > Key: SOLR-1277 > URL: https://issues.apache.org/jira/browse/SOLR-1277 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.5 > > Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, > SOLR-1277.patch, SOLR-1277.patch, zookeeper-3.2.1.jar > > Original Estimate: 672h > Remaining Estimate: 672h > > The goal is to give Solr server clusters self-healing attributes > where if a server fails, indexing and searching don't stop and > all of the partitions remain searchable. For configuration, the > ability to centrally deploy a new configuration without servers > going offline. > We can start with basic failover and start from there? > Features: > * Automatic failover (i.e. when a server fails, clients stop > trying to index to or search it) > * Centralized configuration management (i.e. new solrconfig.xml > or schema.xml propagates to a live Solr cluster) > * Optionally allow shards of a partition to be moved to another > server (i.e. if a server gets hot, move the hot segments out to > cooler servers). Ideally we'd have a way to detect hot segments > and move them seamlessly. With NRT this becomes somewhat more > difficult but not impossible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)
[ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791528#action_12791528 ] Jason Rutherglen commented on SOLR-1277: bq. as two types of failures, possibly A failure is a failure and whether it's the GC or something else, it's really the same thing. Sounds like we're defining the expectation of the client handling of a failure? I think we'll need to define groups of shards (maybe this is already in the spec), and allow a configurable failure setting per group. For example, group "live" would be allowed to return partial results because the user always wants results returned quickly. Group "archive" would always return complete results (if a node is down it can be configured to retry the request N times until it succeeds under a given max timeout). Also a request could be addressed to a group of shards, which would allow one set of replicated Zookeeper servers for N Solr clusters (instead of a Zookeeper server per Solr cluster). How are we addressing a failed connection to a slave server, and instead of failing the request, re-making the request to an adjacent slave? > Implement a Solr specific naming service (using Zookeeper) > -- > > Key: SOLR-1277 > URL: https://issues.apache.org/jira/browse/SOLR-1277 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.5 > > Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, > SOLR-1277.patch, SOLR-1277.patch, zookeeper-3.2.1.jar > > Original Estimate: 672h > Remaining Estimate: 672h > > The goal is to give Solr server clusters self-healing attributes > where if a server fails, indexing and searching don't stop and > all of the partitions remain searchable. For configuration, the > ability to centrally deploy a new configuration without servers > going offline. > We can start with basic failover and start from there? > Features: > * Automatic failover (i.e. when a server fails, clients stop > trying to index to or search it) > * Centralized configuration management (i.e. new solrconfig.xml > or schema.xml propagates to a live Solr cluster) > * Optionally allow shards of a partition to be moved to another > server (i.e. if a server gets hot, move the hot segments out to > cooler servers). Ideally we'd have a way to detect hot segments > and move them seamlessly. With NRT this becomes somewhat more > difficult but not impossible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1506) Search multiple cores using MultiReader
[ https://issues.apache.org/jira/browse/SOLR-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789600#action_12789600 ] Jason Rutherglen commented on SOLR-1506: There's a different bug here, where because CoreContainer loads the cores sequentially, and MultiCoreReaderFactory looks for all the cores, when the proxy core isn't last, not all the cores are searchable, if the proxy is first, an exception is thrown. The workaround is to place the proxy core last, however that's not possible when using the core admin HTTP API. Hmm... Not sure what the best workaround is. > Search multiple cores using MultiReader > --- > > Key: SOLR-1506 > URL: https://issues.apache.org/jira/browse/SOLR-1506 > Project: Solr > Issue Type: Improvement > Components: search > Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Trivial > Fix For: 1.5 > > Attachments: SOLR-1506.patch, SOLR-1506.patch, SOLR-1506.patch > > > I need to search over multiple cores, and SOLR-1477 is more > complicated than expected, so here we'll create a MultiReader > over the cores to allow searching on them. > Maybe in the future we can add parallel searching however > SOLR-1477, if it gets completed, provides that out of the box. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787800#action_12787800 ] Jason Rutherglen commented on SOLR-1606: The current NRT IndexWriter.getReader API cannot yet support IndexReaderFactory, I'll open a Lucene issue. > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787686#action_12787686 ] Jason Rutherglen commented on SOLR-1606: I was going to start on the auto-warming using IndexWriter's IndexReaderWarmer, however because this is heavily cache dependent I think it'll have to wait for SOLR-1308 because we need to regenerate the cache per reader. > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787621#action_12787621 ] Jason Rutherglen commented on SOLR-1606: {quote}For example, q=foo&freshness=1000 would cause a new realtime reader to be opened of the current one was more than 1000ms old.{quote} Good idea. > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787619#action_12787619 ] Jason Rutherglen commented on SOLR-1606: {quote}In any case, I assume it must not fsync the files, so you don't get a commit where you know your in a stable condition?{quote} OK, right, for the user commit currently means that after the call, the index is in a stable state, and that it can be replicated? I agree, for clarity, I'll create a refresh command and remove the NRT option from the commit command. > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787221#action_12787221 ] Jason Rutherglen commented on SOLR-1606: bq. Don't we need a new command, like update_realtime We could however it'd work the same as commit? Meaning afterwards, all pending changes (including deletes) are available? The commit command is fairly overloaded as is. Are you thinking in terms of replication? > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787206#action_12787206 ] Jason Rutherglen commented on SOLR-1606: Koji, Looks like a change to trunk is causing the error, also when I step through it passes, when I run without stepping it fails... > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-433) MultiCore and SpellChecker replication
[ https://issues.apache.org/jira/browse/SOLR-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787155#action_12787155 ] Jason Rutherglen commented on SOLR-433: --- Are the existing patches for multiple cores or only for spellchecking? > MultiCore and SpellChecker replication > -- > > Key: SOLR-433 > URL: https://issues.apache.org/jira/browse/SOLR-433 > Project: Solr > Issue Type: Improvement > Components: replication (scripts), spellchecker >Affects Versions: 1.3 >Reporter: Otis Gospodnetic > Fix For: 1.5 > > Attachments: RunExecutableListener.patch, SOLR-433-r698590.patch, > SOLR-433.patch, SOLR-433.patch, SOLR-433.patch, SOLR-433.patch, > solr-433.patch, SOLR-433_unified.patch, spellindexfix.patch > > > With MultiCore functionality coming along, it looks like we'll need to be > able to: > A) snapshot each core's index directory, and > B) replicate any and all cores' complete data directories, not just their > index directories. > Pulled from the "spellchecker and multi-core index replication" thread - > http://markmail.org/message/pj2rjzegifd6zm7m > Otis: > I think that makes sense - distribute everything for a given core, not just > its index. And the spellchecker could then also have its data dir (and only > index/ underneath really) and be replicated in the same fashion. > Right? > Ryan: > Yes, that was my thought. If an arbitrary directory could be distributed, > then you could have > /path/to/dist/index/... > /path/to/dist/spelling-index/... > /path/to/dist/foo > and that would all get put into a snapshot. This would also let you put > multiple cores within a single distribution: > /path/to/dist/core0/index/... > /path/to/dist/core0/spelling-index/... > /path/to/dist/core0/foo > /path/to/dist/core1/index/... > /path/to/dist/core1/spelling-index/... > /path/to/dist/core1/foo -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1308) Cache docsets at the SegmentReader level
[ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786240#action_12786240 ] Jason Rutherglen commented on SOLR-1308: {quote} Yeah... that's a pain. We could easily do per-segment faceting for non-string types though (int, long, etc) since they don't need to be merged. {quote} I opened SOLR-1617 for this. I think doc sets can be handled with a multi doc set (hopefully). Facets however, argh, FacetComponent is really hairy, though I think it boils down to simply adding field values of the same up? Then there seems to be edge cases which I'm scared of. At least it's easy to test whether we're fulfilling todays functionality by randomly unit testing per-segment and multi-segment side by side (i.e. if the results of one are different than the results of the other, we know there's something to fix). Perhaps we can initially add up field values, and test that (which is enough for my project), and move from there. I'd still like to genericize all of the distributed processes to work over multiple segments (like Lucene distributed search uses a MultiSearcher which also works locally), so that local or distributed is the same API wise. However given I've had trouble figuring out the existing distributed code (SOLR-1477 ran into a wall). Maybe as part of SolrCloud http://wiki.apache.org/solr/SolrCloud, we can rework the distributed APIs to be more user friendly (i.e. *MultiSearcher is really easy to understand). If Solr's going to work well in the cloud, distributed search probably needs to be easy to multi tier for scaling (i.e. if we have 1 proxy server and 100 nodes, we could have 1 top proxy, and 1 proxy per 10 nodes, etc). > Cache docsets at the SegmentReader level > > > Key: SOLR-1308 > URL: https://issues.apache.org/jira/browse/SOLR-1308 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Original Estimate: 504h > Remaining Estimate: 504h > > Solr caches docsets at the top level Multi*Reader level. After a > commit, the filter/docset caches are flushed. Reloading the > cache in near realtime (i.e. commits every 1s - 2min) > unnecessarily consumes IO resources when reloading the filters, > especially for largish indexes. > We'll cache docsets at the SegmentReader level. The cache key > will include the reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1619) Cache documents by their internal ID
[ https://issues.apache.org/jira/browse/SOLR-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786233#action_12786233 ] Jason Rutherglen commented on SOLR-1619: Right, we'd somehow give the user either option. > Cache documents by their internal ID > > > Key: SOLR-1619 > URL: https://issues.apache.org/jira/browse/SOLR-1619 > Project: Solr > Issue Type: Improvement > Components: search >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > > Currently documents are cached by their Lucene docid, however we can instead > cache them using their schema derived unique id. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1619) Cache documents by their internal ID
Cache documents by their internal ID Key: SOLR-1619 URL: https://issues.apache.org/jira/browse/SOLR-1619 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Currently documents are cached by their Lucene docid, however we can instead cache them using their schema derived unique id. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1618) Merge docsets on segment merge
Merge docsets on segment merge -- Key: SOLR-1618 URL: https://issues.apache.org/jira/browse/SOLR-1618 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 When SOLR-1308 is implemented, we can save some time when creating new docsets by merging them in RAM as segments are merged (similar to LUCENE-1785) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1617) Cache and merge facets per segment
Cache and merge facets per segment -- Key: SOLR-1617 URL: https://issues.apache.org/jira/browse/SOLR-1617 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Spinoff from SOLR-1308. We'll enable per-segment facet caching and merging which will allow near realtime faceted searching. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1308) Cache docsets at the SegmentReader level
[ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1308: --- Description: Solr caches docsets at the top level Multi*Reader level. After a commit, the filter/docset caches are flushed. Reloading the cache in near realtime (i.e. commits every 1s - 2min) unnecessarily consumes IO resources when reloading the filters, especially for largish indexes. We'll cache docsets at the SegmentReader level. The cache key will include the reader. was: Solr caches docsets and documents at the top level Multi*Reader level. After a commit, the caches are flushed. Reloading the caches in near realtime (i.e. commits every 1s - 2min) unnecessarily consumes IO resources, especially for largish indexes. We can cache docsets and documents at the SegmentReader level. The cache settings in SolrConfig can be applied to the individual SR caches. Summary: Cache docsets at the SegmentReader level (was: Cache docsets and docs at the SegmentReader level) I changed the title because we're not going to cache docs in this issue (though I think it's possible to cache docs by the internal id, rather than the doc id). Per-segment facet caching and merging per segment can go into a different issue. > Cache docsets at the SegmentReader level > > > Key: SOLR-1308 > URL: https://issues.apache.org/jira/browse/SOLR-1308 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Original Estimate: 504h > Remaining Estimate: 504h > > Solr caches docsets at the top level Multi*Reader level. After a > commit, the filter/docset caches are flushed. Reloading the > cache in near realtime (i.e. commits every 1s - 2min) > unnecessarily consumes IO resources when reloading the filters, > especially for largish indexes. > We'll cache docsets at the SegmentReader level. The cache key > will include the reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level
[ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785433#action_12785433 ] Jason Rutherglen commented on SOLR-1308: I realized because of UnInvertedField, we'll need to merge facet results from UIF per reader, so using a MultiDocSet won't help. Can we leverage the distributed merging FacetComponent implements (i.e. reuse and/or change the code to work in both the distributed and local cases)? Ah well, I was hoping for an easy solution for realtime facets. > Cache docsets and docs at the SegmentReader level > - > > Key: SOLR-1308 > URL: https://issues.apache.org/jira/browse/SOLR-1308 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Original Estimate: 504h > Remaining Estimate: 504h > > Solr caches docsets and documents at the top level Multi*Reader > level. After a commit, the caches are flushed. Reloading the > caches in near realtime (i.e. commits every 1s - 2min) > unnecessarily consumes IO resources, especially for largish > indexes. > We can cache docsets and documents at the SegmentReader level. > The cache settings in SolrConfig can be applied to the > individual SR caches. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)
[ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785014#action_12785014 ] Jason Rutherglen commented on SOLR-1277: bq. The question then becomes what do you want to make automatic vs those things that require operator intervention. Right, I'd like the distributed Solr + ZK system to automatically failover to another server if there's a functional software failure. Also, with a search system query times are very important and if they suddenly drop off on a replicated server, the node needs to be removed and a new server brought online (hopefully automatically). If Solr + ZK doesn't take out a server whose query times are 10 times the average of the other comparable replicated slave servers, then it 's harder to justify going live with it, in my humble opinion because it's not really solving the main reason to use a naming service. While this may not be functionality we need in an initial release, it's important to insure our initial design does not limit future functionality. > Implement a Solr specific naming service (using Zookeeper) > -- > > Key: SOLR-1277 > URL: https://issues.apache.org/jira/browse/SOLR-1277 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.5 > > Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, > SOLR-1277.patch, zookeeper-3.2.1.jar > > Original Estimate: 672h > Remaining Estimate: 672h > > The goal is to give Solr server clusters self-healing attributes > where if a server fails, indexing and searching don't stop and > all of the partitions remain searchable. For configuration, the > ability to centrally deploy a new configuration without servers > going offline. > We can start with basic failover and start from there? > Features: > * Automatic failover (i.e. when a server fails, clients stop > trying to index to or search it) > * Centralized configuration management (i.e. new solrconfig.xml > or schema.xml propagates to a live Solr cluster) > * Optionally allow shards of a partition to be moved to another > server (i.e. if a server gets hot, move the hot segments out to > cooler servers). Ideally we'd have a way to detect hot segments > and move them seamlessly. With NRT this becomes somewhat more > difficult but not impossible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)
[ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784973#action_12784973 ] Jason Rutherglen commented on SOLR-1277: If we're detecting node failure, it seems the functionality of Solr should also be detected for failure. The discussions thus far seem to be around network or process failure which is usually either intermittent or terminal. Detecting measurable increase/decreases in CPU, RAM consumption, OOMs, query failures, indexing failures due to bugs are probably more important than the network being down because they are harder to detect and fix. How is HBase handling the detection of functional issues in relation to ZK? > Implement a Solr specific naming service (using Zookeeper) > -- > > Key: SOLR-1277 > URL: https://issues.apache.org/jira/browse/SOLR-1277 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 > Reporter: Jason Rutherglen >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.5 > > Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, > SOLR-1277.patch, zookeeper-3.2.1.jar > > Original Estimate: 672h > Remaining Estimate: 672h > > The goal is to give Solr server clusters self-healing attributes > where if a server fails, indexing and searching don't stop and > all of the partitions remain searchable. For configuration, the > ability to centrally deploy a new configuration without servers > going offline. > We can start with basic failover and start from there? > Features: > * Automatic failover (i.e. when a server fails, clients stop > trying to index to or search it) > * Centralized configuration management (i.e. new solrconfig.xml > or schema.xml propagates to a live Solr cluster) > * Optionally allow shards of a partition to be moved to another > server (i.e. if a server gets hot, move the hot segments out to > cooler servers). Ideally we'd have a way to detect hot segments > and move them seamlessly. With NRT this becomes somewhat more > difficult but not impossible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level
[ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784668#action_12784668 ] Jason Rutherglen commented on SOLR-1308: I'm taking a look at this, it's straightforward to cache and reuse docsets per reader in SolrIndexSearcher, however, we're passing docsets all over the place (i.e. UnInvertedField). We can't exactly rip out DocSet without breaking most unit tests, and writing a bunch of facet merging code. We'd likely lose functionality? Will the MultiDocSet concept SOLR-568 as an easy way to get something that works up and running? Then we can benchmark and see if we've lost performance? > Cache docsets and docs at the SegmentReader level > - > > Key: SOLR-1308 > URL: https://issues.apache.org/jira/browse/SOLR-1308 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Original Estimate: 504h > Remaining Estimate: 504h > > Solr caches docsets and documents at the top level Multi*Reader > level. After a commit, the caches are flushed. Reloading the > caches in near realtime (i.e. commits every 1s - 2min) > unnecessarily consumes IO resources, especially for largish > indexes. > We can cache docsets and documents at the SegmentReader level. > The cache settings in SolrConfig can be applied to the > individual SR caches. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1614) Search in Hadoop
Search in Hadoop Key: SOLR-1614 URL: https://issues.apache.org/jira/browse/SOLR-1614 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 What's the use case? Sometimes queries are expensive (such as regex) or one has indexes located in HDFS, that then need to be searched on. By leveraging Hadoop, these non-time sensitive queries may be executed without dynamically deploying the indexes to new Solr servers. We'll download the index out of HDFS (assuming they're zipped), perform the queries in a batch on the index shard, then merge the results either using a Solr query results priority queue, or simply using Hadoop's built in merge sorting. The query file will be encoded in JSON format, (ID, query, numresults,fields). The shards file will simply contain newline delimited paths (HDFS or otherwise). The output can be a Solr encoded results file per query. I'm hoping to add an actual Hadoop unit test. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1610) Add generics to SolrCache
[ https://issues.apache.org/jira/browse/SOLR-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1610: --- Attachment: SOLR-1610.patch Compiles, ran some of the unit tests. Not sure what else needs to be done? > Add generics to SolrCache > - > > Key: SOLR-1610 > URL: https://issues.apache.org/jira/browse/SOLR-1610 > Project: Solr > Issue Type: Improvement > Components: search >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Trivial > Fix For: 1.5 > > Attachments: SOLR-1610.patch > > > Seems fairly simple for SolrCache to have generics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1610) Add generics to SolrCache
Add generics to SolrCache - Key: SOLR-1610 URL: https://issues.apache.org/jira/browse/SOLR-1610 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Trivial Fix For: 1.5 Seems fairly simple for SolrCache to have generics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Entity Extraction feature
Stanford's is open source and works quite well. http://nlp.stanford.edu/software/CRF-NER.shtml On Tue, Nov 17, 2009 at 10:25 PM, Pradeep Pujari wrote: > Hello all, > > Does Lucene or Solr has entity extraction feature? If so, what is the wiki > URL? > > Thanks, > Pradeep. > >
SolrCache not using generics?
Maybe we can add generics to SolrCache or is there a design reason not to?
[jira] Created: (SOLR-1609) Create a cache implementation that limits itself to a given RAM size
Create a cache implementation that limits itself to a given RAM size Key: SOLR-1609 URL: https://issues.apache.org/jira/browse/SOLR-1609 Project: Solr Issue Type: New Feature Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 This is a spinoff from the unrelated SOLR-1308. We can limit the cache sizes by estimated RAM usage. I think in some cases this is a better approach when compared with using soft references as this will effectively limit the cache RAM used. Soft references will utilize the max heap before divesting itself of excessive cached items, which in some cases may not be the desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1606) Integrate Near Realtime
[ https://issues.apache.org/jira/browse/SOLR-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1606: --- Attachment: SOLR-1606.patch Solr config can have an index nrt (true|false), or commit can specify the nrt var. With nrt=true, when creating a new searcher we call getReader. > Integrate Near Realtime > > > Key: SOLR-1606 > URL: https://issues.apache.org/jira/browse/SOLR-1606 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1606.patch > > > We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1606) Integrate Near Realtime
Integrate Near Realtime Key: SOLR-1606 URL: https://issues.apache.org/jira/browse/SOLR-1606 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 We'll integrate IndexWriter.getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1578) Develop a Spatial Query Parser
[ https://issues.apache.org/jira/browse/SOLR-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780184#action_12780184 ] Jason Rutherglen commented on SOLR-1578: GBase http://code.google.com/apis/base/docs/2.0/query-lang-spec.html (Locations section at the bottom of the page) has a query syntax for spatial queries (i.e. @+40.75-074.00 + 5mi) > Develop a Spatial Query Parser > -- > > Key: SOLR-1578 > URL: https://issues.apache.org/jira/browse/SOLR-1578 > Project: Solr > Issue Type: New Feature >Reporter: Grant Ingersoll > Fix For: 1.5 > > > Given all the work around spatial, it would be beneficial if Solr had a query > parser for dealing with spatial queries. For starters, something that used > geonames data or maybe even Google Maps API would be really useful. Longer > term, a spatial grammar that can robustly handle all the vagaries of > addresses, etc. would be really cool. > Refs: > [1] http://www.geonames.org/export/client-libraries.html (note the Java > client is ASL) > [2] Data from geo names: http://download.geonames.org/export/dump/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1506) Search multiple cores using MultiReader
[ https://issues.apache.org/jira/browse/SOLR-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1506: --- Attachment: SOLR-1506.patch MultiReader doesn't support reopen with the readOnly parameter. This patch adds a test case for commit on the proxy, and a workaround (if unsupported is caught, then regular reopen is called). > Search multiple cores using MultiReader > --- > > Key: SOLR-1506 > URL: https://issues.apache.org/jira/browse/SOLR-1506 > Project: Solr > Issue Type: Improvement > Components: search >Affects Versions: 1.4 > Reporter: Jason Rutherglen >Priority: Trivial > Fix For: 1.5 > > Attachments: SOLR-1506.patch, SOLR-1506.patch, SOLR-1506.patch > > > I need to search over multiple cores, and SOLR-1477 is more > complicated than expected, so here we'll create a MultiReader > over the cores to allow searching on them. > Maybe in the future we can add parallel searching however > SOLR-1477, if it gets completed, provides that out of the box. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.