RE: Solr with Hadoop
I'm familiar with and have used both the DSE cluster as well as am in the process of evaluating cloudera search, in general cloudera search has tight integration with hdfs and takes care of replication and sharding transparently by using the pre-existing hdfs replication and sharding, however cloudera search actually uses solrcloud underneath and you would need to install zookeeper to enable coordination between each of the solr nodes. DataStax allows you to talk to Solr, however their model scales around the data model and architecture of cassandra, release 3.1 allows for some additional solr admin functionality and removes the need to write cassandra specific code. If you go the open source route you have a few options: 1) You can build a custom plugin inside solr that would internally query hdfs and return data, you would need to figure out how to scale this potentially using a solution very similar to cloudera search (i.e. leverage solrcloud), and if using solrcloud you would need ot install zookeeper for node coordination 2) You could write create a flume channel that accumulates specific events from hdfs and create a sink to write data directly to solr 3) I would look at cloudera search if you need tight integration into hadoop, it might save you some time and efforts I dont think you want to have solr trigger map-reduce jobs if you're looking at having very fast throughput through your search service. Hope this helps, ping me offline if you have more questions. Regards From: mlie...@impetus.com To: solr-user@lucene.apache.org Subject: Re: Solr with Hadoop Date: Thu, 18 Jul 2013 15:41:36 + Rajesh, If you require to have an integration between Solr and Hadoop or NoSQL, I would recommend using a commercial distribution. I think most are free to use as long as you don't require support. I inquired about the Cloudera Search capability, but it seems like that far it is just preliminary: there is no tight integration yet between Hbase and Solr, for example, other than full text search on the HDFS data (I believe enabled in Hue). I am not too familiar with what MapR's M7 has to offer. However Datastax does a good job of tightly integrating Solr with Cassandra, and lets you query over the data ingested from Solr in Hive for example, which is pretty nice. Solr would not trigger Hadoop jobs, though. Cheers, Matt On 7/17/13 7:37 PM, Rajesh Jain rjai...@gmail.com wrote: I have a newbie question on integrating Solr with Hadoop. There are some vendors like Cloudera/MapR who have announced Solr Search for Hadoop. If I use the Apache distro, how can I use Solr Search on docs in HDFS/Hadoop Is there a tutorial on how to use it or getting started. I am using Flume to sink CSV docs into Hadoop/HDFS and I would like to use Solr to provide Search. Does Solr Search trigger MapReduce Jobs (like Splunk-Hunk) does? Thanks, Rajesh NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: preferred container for running SolrCloud
We're running under jetty. Sent from my iPhone On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote: 1) Jboss 2) Jetty 3) Tomcat 4) Other.. ?
RE: preferred container for running SolrCloud
Separate Zookeeper. Date: Thu, 11 Jul 2013 19:27:18 -0700 Subject: Re: preferred container for running SolrCloud From: docbook@gmail.com To: solr-user@lucene.apache.org With the embedded Zookeeper or separate Zookeeper? Also have run into any issues with running SolrCloud on jetty? On Thu, Jul 11, 2013 at 7:01 PM, Saikat Kanjilal sxk1...@hotmail.comwrote: We're running under jetty. Sent from my iPhone On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote: 1) Jboss 2) Jetty 3) Tomcat 4) Other.. ?
RE: preferred container for running SolrCloud
One last thing, no issues with jetty. The issues we did have was actually running separate zookeeper clusters. From: sxk1...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: preferred container for running SolrCloud Date: Thu, 11 Jul 2013 20:13:27 -0700 Separate Zookeeper. Date: Thu, 11 Jul 2013 19:27:18 -0700 Subject: Re: preferred container for running SolrCloud From: docbook@gmail.com To: solr-user@lucene.apache.org With the embedded Zookeeper or separate Zookeeper? Also have run into any issues with running SolrCloud on jetty? On Thu, Jul 11, 2013 at 7:01 PM, Saikat Kanjilal sxk1...@hotmail.comwrote: We're running under jetty. Sent from my iPhone On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote: 1) Jboss 2) Jetty 3) Tomcat 4) Other.. ?
RE: Content based recommender using lucene/solr
Why not just use mahout to do this, there is an item similarity algorithm in mahout that does exactly this :) https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html You can use mahout in distributed and non-distributed mode as well. From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 12:16:57 -0500 Subject: Content based recommender using lucene/solr To: solr-user@lucene.apache.org; java-u...@lucene.apache.org Hi, I'm using lucene and solr right now in a production environment with an index of about a million docs. I'm working on a recommender that basically would list the n most similar items to the user based on the current item he is viewing. I've been thinking of using solr/lucene since I already have all docs available and I want a quick version that can be deployed while we work on a more robust recommender. How about overriding the default similarity so that it scores documents based on the euclidean distance of normalized item attributes and then using a morelikethis component to pass in the attributes of the item for which I want to generate recommendations? I know it has its issues like recomputing scores/normalization/weight application at query time which could make this idea unfeasible/impractical. I'm at a very preliminary stage right now with this and would love some suggestions from experienced users. thank you, Luis Guerrero
RE: Content based recommender using lucene/solr
You could build a custom recommender in mahout to accomplish this, also just out of curiosity why the content based approach as opposed to building a recommender based on co-occurence. One other thing, what is your data size, are you looking at scale where you need something like hadoop? From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 13:02:00 -0500 Subject: Re: Content based recommender using lucene/solr To: solr-user@lucene.apache.org CC: java-u...@lucene.apache.org Hey saikat, thanks for your suggestion. I've looked into mahout and other alternatives for computing k nearest neighbors. I would have to run a job and computer the k nearest neighbors and track them in the index for retrieval. I wanted to see if this was something I could do with lucene using lucene's scoring function and solr's morelikethis component. The job you specifically mention is for Item based recommendation which would require me to track the different items users have viewed. I'm looking for a content based approach where I would use a distance measure to establish how near items are (how similar) and have some kind of training phase to adjust weights. On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.comwrote: Why not just use mahout to do this, there is an item similarity algorithm in mahout that does exactly this :) https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html You can use mahout in distributed and non-distributed mode as well. From: lcguerreroc...@gmail.com Date: Fri, 28 Jun 2013 12:16:57 -0500 Subject: Content based recommender using lucene/solr To: solr-user@lucene.apache.org; java-u...@lucene.apache.org Hi, I'm using lucene and solr right now in a production environment with an index of about a million docs. I'm working on a recommender that basically would list the n most similar items to the user based on the current item he is viewing. I've been thinking of using solr/lucene since I already have all docs available and I want a quick version that can be deployed while we work on a more robust recommender. How about overriding the default similarity so that it scores documents based on the euclidean distance of normalized item attributes and then using a morelikethis component to pass in the attributes of the item for which I want to generate recommendations? I know it has its issues like recomputing scores/normalization/weight application at query time which could make this idea unfeasible/impractical. I'm at a very preliminary stage right now with this and would love some suggestions from experienced users. thank you, Luis Guerrero -- Luis Carlos Guerrero Covo M.S. Computer Engineering (57) 3183542047
RE: Creating a new core programmicatically in solr
I'm aware of the CoreAdminRequest API, however given the fact that our solr cluster machines have their own internal configurations I'd prefer to use the http approach rather then having to specify the instanceDir or the solrServer. One issue I was thinking of was the double quotes needed around the curl command and how to simulate that through the restclient or even the java code, its weird that this is needed. Anyways thanks for the inputs. Date: Tue, 4 Jun 2013 09:52:03 -0700 From: bbar...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Creating a new core programmicatically in solr I would use the below method to create new core on the fly... CoreAdminResponse e = CoreAdminRequest.createCore(name, instanceDir, server); http://lucene.apache.org/solr/4_3_0/solr-solrj/org/apache/solr/client/solrj/response/CoreAdminResponse.html -- View this message in context: http://lucene.472066.n3.nabble.com/Creating-a-new-core-programmicatically-in-solr-tp4068132p4068134.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Creating a new core programmicatically in solr
I need to simulate this curl command line with java code: curl http://10.42.6.74:8983/solr/admin/cores?action=CREATEname=NEW_SCHEMA.solr; Obviously doing a simple HttpGet with the appropriate query parameters is not the answer. I dont believe your example is not going to work because I am passing in a string already into the constructor for the HttpGet class (namely getSolrClient().getBaseUrl()+ADMIN_CORE_CONSTRUCT+?action=+action+name=+name), adding quotes around it will confuse the compiler. Let me know if I missed something here. Date: Tue, 4 Jun 2013 10:03:50 -0700 From: bbar...@gmail.com To: solr-user@lucene.apache.org Subject: RE: Creating a new core programmicatically in solr did you try escaping double quotes when you are making the http request. HttpGet req = new HttpGet(\+getSolrClient().getBaseUrl()+ADMIN_CORE_CONSTRUCT+?action=+action+name=+name+\); HttpResponse response = client.execute(request); -- View this message in context: http://lucene.472066.n3.nabble.com/Creating-a-new-core-programmicatically-in-solr-tp4068132p4068139.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Creating a new core programmicatically in solr
Thanks, good catch, completely forgot about the and its meaning in unix. From: j...@basetechnology.com To: solr-user@lucene.apache.org Subject: Re: Creating a new core programmicatically in solr Date: Tue, 4 Jun 2013 13:22:34 -0400 The double quotes are required for curl simply because of the , which tells the shell to run the preceding command in the background. The quotes around the full URL escape the . -- Jack Krupansky -Original Message- From: Saikat Kanjilal Sent: Tuesday, June 04, 2013 12:56 PM To: solr-user@lucene.apache.org Subject: RE: Creating a new core programmicatically in solr I'm aware of the CoreAdminRequest API, however given the fact that our solr cluster machines have their own internal configurations I'd prefer to use the http approach rather then having to specify the instanceDir or the solrServer.One issue I was thinking of was the double quotes needed around the curl command and how to simulate that through the restclient or even the java code, its weird that this is needed. Anyways thanks for the inputs. Date: Tue, 4 Jun 2013 09:52:03 -0700 From: bbar...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Creating a new core programmicatically in solr I would use the below method to create new core on the fly... CoreAdminResponse e = CoreAdminRequest.createCore(name, instanceDir, server); http://lucene.apache.org/solr/4_3_0/solr-solrj/org/apache/solr/client/solrj/response/CoreAdminResponse.html -- View this message in context: http://lucene.472066.n3.nabble.com/Creating-a-new-core-programmicatically-in-solr-tp4068132p4068134.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Keeping a rolling window of indexes around solr
At first glance unless I missed something hourglass will definitely not work for our use-case which just involves real time inserts of new log data and no appends at all. However I would like to examine the guts of hourglass to see if we can customize it for our use-case. From: arafa...@gmail.com Date: Mon, 27 May 2013 16:17:12 -0400 Subject: Re: Keeping a rolling window of indexes around solr To: solr-user@lucene.apache.org But how is Hourglass going to help Solr? Or is it a portable implementation? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, SolrCloud now has the same index aliasing as Elasticsearch. I can't lookup the link now but Zoie from LinkedIn has Hourglass, which is uses for circular buffer sort of index setup if I recall correctly. Otis Solr ElasticSearch Support http://sematext.com/ On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Keeping a rolling window of indexes around solr
Volume of data: 1 log insert every 30 seconds, queries done sporadically asynchronously every so often at a much lower frequency every few days Also the majority of the requests are indeed going to be within a splice of time (typically hours or at most a few days) Type of queries: Keyword or termsearch Search by guid (or id as known in the solr world) Reserved or percolation queries to be executed when new data becomes available Search by dates as mentioned above Regards Sent from my iPhone On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : This is kind of the approach used by elastic search , if I'm not using : solrcloud will I be able to use shard aliasing, also with this approach : how would replication work, is it even needed? you haven't said much about hte volume of data you expect to deal with, nor have you really explained what types of queries you intend to do -- ie: you said you were intersted in a rolling window of indexes around n days of data but you never clarified why you think a rolling window of indexes would be useful to you or how exactly you would use it. The primary advantage of sharding by date is if you know that a large percentage of your queries are only going to be within a small range of time, and therefore you can optimize those requests to only hit the shards neccessary to satisfy that small windo of time. if the majority of requests are going to be across your entire n days of data, then date based sharding doesn't really help you -- you can just use arbitrary (randomized) sharding using periodic deleteByQuery commands to purge anything older then N days. Query the whole collection by default, and add a filter query if/when you want to restrict your search to only a narrow date range of documents. this is the same general approach you would use on a non-distributed / non-SolrCloud setup if you just had a single collection on a single master replicated to some number of slaves for horizontal scaling. -Hoss
Keeping a rolling window of indexes around solr
Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
RE: Keeping a rolling window of indexes around solr
I would like to see something similar to this existing in the solr world or I could gladly help create this: https://github.com/karussell/elasticsearch-rollindex We are evaluating both elasticsearch and our current solr architecture and need to manage write heavy use-cases within a rolling window. Date: Fri, 24 May 2013 09:07:38 -0600 From: elyog...@elyograg.org To: solr-user@lucene.apache.org Subject: Re: Keeping a rolling window of indexes around solr On 5/24/2013 8:56 AM, Shawn Heisey wrote: On 5/24/2013 8:25 AM, Saikat Kanjilal wrote: Anyways would love to hear thoughts and usecases that are similar from the community. Your use-case sounds a lot like what loggly was doing back in 2010. http://loggly.com/videos/lucene-revolution-2010/ While I was writing that, I accidentally pressed the key combination that told my mail client to send the message before I was done. Loggly created a new shard every five minutes, and merged older shards to longer time intervals. I personally don't need this capability, but it is a useful pattern. I was wondering recently whether a custom document router could be built for SolrCloud that automatically manages time-divided shards - creating, merging, and if you're not keeping the data forever, deleting. Thanks, Shawn
Re: Keeping a rolling window of indexes around solr
This is kind of the approach used by elastic search , if I'm not using solrcloud will I be able to use shard aliasing, also with this approach how would replication work, is it even needed? Sent from my iPhone On May 24, 2013, at 12:00 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Would collection aliasing help here? From Solr 4.2 release notes: Collection Aliasing. Got time based data? Want to re-index in a temporary collection and then swap it into production? Done. Stay tuned for Shard Aliasing. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards