RE: Solr with Hadoop

2013-07-18 Thread Saikat Kanjilal
I'm familiar with and have used both the DSE cluster as well as am in the 
process of evaluating cloudera search, in general cloudera search has tight 
integration with hdfs and takes care of replication and sharding transparently 
by using the pre-existing hdfs replication and sharding, however cloudera 
search actually uses solrcloud underneath and you would need to install 
zookeeper to enable coordination between each of the solr nodes.   DataStax 
allows you to talk to Solr, however their model scales around the data model 
and architecture of cassandra, release 3.1 allows for some additional solr 
admin functionality and removes the need to write cassandra specific code.

If you go the open source route you have a few options:

1) You can build a custom plugin inside solr that would internally query hdfs 
and return data, you would need to figure out how to scale this potentially 
using a solution very similar to cloudera search (i.e. leverage solrcloud), and 
if using solrcloud you would need ot install zookeeper for node coordination

2) You could write create a flume channel that accumulates specific events from 
hdfs and create a sink to write data directly to solr

3) I would look at cloudera search if you need tight integration into hadoop, 
it might save you some time and efforts

I dont think you want to have solr trigger map-reduce jobs if you're looking at 
having very fast throughput through your search service.


Hope this helps, ping me offline if you have more questions.
Regards

 From: mlie...@impetus.com
 To: solr-user@lucene.apache.org
 Subject: Re: Solr with Hadoop
 Date: Thu, 18 Jul 2013 15:41:36 +
 
 Rajesh,
 
 If you require to have an integration between Solr and Hadoop or NoSQL, I
 would recommend using a commercial distribution. I think most are free to
 use as long as you don't require support.
 I inquired about the Cloudera Search capability, but it seems like that
 far it is just preliminary: there is no tight integration yet between
 Hbase and Solr, for example, other than full text search on the HDFS data
 (I believe enabled in Hue). I am not too familiar with what MapR's M7 has
 to offer.
 However Datastax does a good job of tightly integrating Solr with
 Cassandra, and lets you query over the data ingested from Solr in Hive for
 example, which is pretty nice. Solr would not trigger Hadoop jobs, though.
 
 Cheers,
 Matt
 
 
 On 7/17/13 7:37 PM, Rajesh Jain rjai...@gmail.com wrote:
 
 I
  have a newbie question on integrating Solr with Hadoop.
 
 There are some vendors like Cloudera/MapR who have announced Solr Search
 for Hadoop.
 
 If I use the Apache distro, how can I use Solr Search on docs in
 HDFS/Hadoop
 
 Is there a tutorial on how to use it or getting started.
 
 I am using Flume to sink CSV docs into Hadoop/HDFS and I would like to use
 Solr to provide Search.
 
 Does Solr Search trigger MapReduce Jobs (like Splunk-Hunk) does?
 
 Thanks,
 Rajesh
 
 
 
 
 
 
 
 
 
 
 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.
  

Re: preferred container for running SolrCloud

2013-07-11 Thread Saikat Kanjilal
We're running under jetty.

Sent from my iPhone

On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote:

 1) Jboss
 2) Jetty
 3) Tomcat
 4) Other..
 
 ?


RE: preferred container for running SolrCloud

2013-07-11 Thread Saikat Kanjilal
Separate Zookeeper.

 Date: Thu, 11 Jul 2013 19:27:18 -0700
 Subject: Re: preferred container for running SolrCloud
 From: docbook@gmail.com
 To: solr-user@lucene.apache.org
 
 With the embedded Zookeeper or separate Zookeeper? Also have run into any
 issues with running SolrCloud on jetty?
 
 
 On Thu, Jul 11, 2013 at 7:01 PM, Saikat Kanjilal sxk1...@hotmail.comwrote:
 
  We're running under jetty.
 
  Sent from my iPhone
 
  On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote:
 
   1) Jboss
   2) Jetty
   3) Tomcat
   4) Other..
  
   ?
 
  

RE: preferred container for running SolrCloud

2013-07-11 Thread Saikat Kanjilal
One last thing, no issues with jetty.  The issues we did have was actually 
running separate zookeeper clusters.

 From: sxk1...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: preferred container for running SolrCloud
 Date: Thu, 11 Jul 2013 20:13:27 -0700
 
 Separate Zookeeper.
 
  Date: Thu, 11 Jul 2013 19:27:18 -0700
  Subject: Re: preferred container for running SolrCloud
  From: docbook@gmail.com
  To: solr-user@lucene.apache.org
  
  With the embedded Zookeeper or separate Zookeeper? Also have run into any
  issues with running SolrCloud on jetty?
  
  
  On Thu, Jul 11, 2013 at 7:01 PM, Saikat Kanjilal sxk1...@hotmail.comwrote:
  
   We're running under jetty.
  
   Sent from my iPhone
  
   On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote:
  
1) Jboss
2) Jetty
3) Tomcat
4) Other..
   
?
  
 
  

RE: Content based recommender using lucene/solr

2013-06-28 Thread Saikat Kanjilal
Why not just use mahout to do this, there is an item similarity algorithm in 
mahout that does exactly this :)

https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

You can use mahout in distributed and non-distributed mode as well.

 From: lcguerreroc...@gmail.com
 Date: Fri, 28 Jun 2013 12:16:57 -0500
 Subject: Content based recommender using lucene/solr
 To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
 
 Hi,
 
 I'm using lucene and solr right now in a production environment with an
 index of about a million docs. I'm working on a recommender that basically
 would list the n most similar items to the user based on the current item
 he is viewing.
 
 I've been thinking of using solr/lucene since I already have all docs
 available and I want a quick version that can be deployed while we work on
 a more robust recommender. How about overriding the default similarity so
 that it scores documents based on the euclidean distance of normalized item
 attributes and then using a morelikethis component to pass in the
 attributes of the item for which I want to generate recommendations? I know
 it has its issues like recomputing scores/normalization/weight application
 at query time which could make this idea unfeasible/impractical. I'm at a
 very preliminary stage right now with this and would love some suggestions
 from experienced users.
 
 thank you,
 
 Luis Guerrero
  

RE: Content based recommender using lucene/solr

2013-06-28 Thread Saikat Kanjilal
You could build a custom recommender in mahout to accomplish this, also just 
out of curiosity why the content based approach as opposed to building a 
recommender based on co-occurence.  One other thing, what is your data size, 
are you looking at scale where you need something like hadoop?

 From: lcguerreroc...@gmail.com
 Date: Fri, 28 Jun 2013 13:02:00 -0500
 Subject: Re: Content based recommender using lucene/solr
 To: solr-user@lucene.apache.org
 CC: java-u...@lucene.apache.org
 
 Hey saikat, thanks for your suggestion. I've looked into mahout and other
 alternatives for computing k nearest neighbors. I would have to run a job
 and computer the k nearest neighbors and track them in the index for
 retrieval. I wanted to see if this was something I could do with lucene
 using lucene's scoring function and solr's morelikethis component. The job
 you specifically mention is for Item based recommendation which would
 require me to track the different items users have viewed. I'm looking for
 a content based approach where I would use a distance measure to establish
 how near items are (how similar) and have some kind of training phase to
 adjust weights.
 
 
 On Fri, Jun 28, 2013 at 12:42 PM, Saikat Kanjilal sxk1...@hotmail.comwrote:
 
  Why not just use mahout to do this, there is an item similarity algorithm
  in mahout that does exactly this :)
 
 
  https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
 
  You can use mahout in distributed and non-distributed mode as well.
 
   From: lcguerreroc...@gmail.com
   Date: Fri, 28 Jun 2013 12:16:57 -0500
   Subject: Content based recommender using lucene/solr
   To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
  
   Hi,
  
   I'm using lucene and solr right now in a production environment with an
   index of about a million docs. I'm working on a recommender that
  basically
   would list the n most similar items to the user based on the current item
   he is viewing.
  
   I've been thinking of using solr/lucene since I already have all docs
   available and I want a quick version that can be deployed while we work
  on
   a more robust recommender. How about overriding the default similarity so
   that it scores documents based on the euclidean distance of normalized
  item
   attributes and then using a morelikethis component to pass in the
   attributes of the item for which I want to generate recommendations? I
  know
   it has its issues like recomputing scores/normalization/weight
  application
   at query time which could make this idea unfeasible/impractical. I'm at a
   very preliminary stage right now with this and would love some
  suggestions
   from experienced users.
  
   thank you,
  
   Luis Guerrero
 
 
 
 
 
 -- 
 Luis Carlos Guerrero Covo
 M.S. Computer Engineering
 (57) 3183542047
  

RE: Creating a new core programmicatically in solr

2013-06-04 Thread Saikat Kanjilal
I'm aware of the CoreAdminRequest API, however given the fact that our solr 
cluster machines have their own internal configurations I'd prefer to use the 
http approach rather then having to specify the instanceDir or the solrServer.  
  One issue I was thinking of was the double quotes needed around the curl 
command and how to simulate that through the restclient or even the java code, 
its weird that this is needed.  Anyways thanks for the inputs.

 Date: Tue, 4 Jun 2013 09:52:03 -0700
 From: bbar...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Creating a new core programmicatically in solr
 
 I would use the below method to create new core on the fly...
 
 CoreAdminResponse e = CoreAdminRequest.createCore(name, instanceDir,
 server);
 
 http://lucene.apache.org/solr/4_3_0/solr-solrj/org/apache/solr/client/solrj/response/CoreAdminResponse.html
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Creating-a-new-core-programmicatically-in-solr-tp4068132p4068134.html
 Sent from the Solr - User mailing list archive at Nabble.com.
  

RE: Creating a new core programmicatically in solr

2013-06-04 Thread Saikat Kanjilal
I need to simulate this curl command line with java code:

curl 
http://10.42.6.74:8983/solr/admin/cores?action=CREATEname=NEW_SCHEMA.solr;

Obviously doing a simple HttpGet with the appropriate query parameters is not 
the answer. I dont believe your example is not going to work because I am 
passing in a string already into the constructor for the HttpGet class (namely 
getSolrClient().getBaseUrl()+ADMIN_CORE_CONSTRUCT+?action=+action+name=+name),
 adding quotes around it will confuse the compiler.   Let me know if I missed 
something here.

 Date: Tue, 4 Jun 2013 10:03:50 -0700
 From: bbar...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: Creating a new core programmicatically in solr
 
 did you try escaping double quotes when you are making the http request.
 
 HttpGet req = new
 HttpGet(\+getSolrClient().getBaseUrl()+ADMIN_CORE_CONSTRUCT+?action=+action+name=+name+\);
  
 
 HttpResponse response = client.execute(request);
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Creating-a-new-core-programmicatically-in-solr-tp4068132p4068139.html
 Sent from the Solr - User mailing list archive at Nabble.com.
  

RE: Creating a new core programmicatically in solr

2013-06-04 Thread Saikat Kanjilal
Thanks, good catch, completely forgot about the  and its meaning in unix.

 From: j...@basetechnology.com
 To: solr-user@lucene.apache.org
 Subject: Re: Creating a new core programmicatically in solr
 Date: Tue, 4 Jun 2013 13:22:34 -0400
 
 The double quotes are required for curl simply because of the , which 
 tells the shell to run the preceding command in the background. The quotes 
 around the full URL escape the .
 
 -- Jack Krupansky
 
 -Original Message- 
 From: Saikat Kanjilal
 Sent: Tuesday, June 04, 2013 12:56 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Creating a new core programmicatically in solr
 
 I'm aware of the CoreAdminRequest API, however given the fact that our solr 
 cluster machines have their own internal configurations I'd prefer to use 
 the http approach rather then having to specify the instanceDir or the 
 solrServer.One issue I was thinking of was the double quotes needed 
 around the curl command and how to simulate that through the restclient or 
 even the java code, its weird that this is needed.  Anyways thanks for the 
 inputs.
 
  Date: Tue, 4 Jun 2013 09:52:03 -0700
  From: bbar...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Creating a new core programmicatically in solr
 
  I would use the below method to create new core on the fly...
 
  CoreAdminResponse e = CoreAdminRequest.createCore(name, instanceDir,
  server);
 
  http://lucene.apache.org/solr/4_3_0/solr-solrj/org/apache/solr/client/solrj/response/CoreAdminResponse.html
 
 
 
 
  --
  View this message in context: 
  http://lucene.472066.n3.nabble.com/Creating-a-new-core-programmicatically-in-solr-tp4068132p4068134.html
  Sent from the Solr - User mailing list archive at Nabble.com.
  
 
  

RE: Keeping a rolling window of indexes around solr

2013-05-28 Thread Saikat Kanjilal
At first glance unless I missed something hourglass will definitely not work 
for our use-case which just involves real time inserts of new log data and no 
appends at all.  However I would like to examine the guts of hourglass to see 
if we can customize it for our use-case.

 From: arafa...@gmail.com
 Date: Mon, 27 May 2013 16:17:12 -0400
 Subject: Re: Keeping a rolling window of indexes around solr
 To: solr-user@lucene.apache.org
 
 But how is Hourglass going to help Solr? Or is it a portable implementation?
 
 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
  the link now but Zoie from LinkedIn has Hourglass, which is uses for
  circular buffer sort of index setup if I recall correctly.
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
  Hello Solr community folks,
  I am doing some investigative work around how to roll and manage indexes
  inside our solr configuration, to date I've come up with an architecture
  that separates a set of masters that are focused on writes and get
  replicated periodically and a set of slave shards strictly docused on
  reads, additionally for each master index the design contains partial
  purges which get performed on each of the slave shards as well as the
  master to keep the data current.   However the architecture seems a bit
  more complex than I'd like with a lot of moving pieces.  I was wondering if
  anyone has ever handled/designed an architecture around a conveyor belt
  or rolling window of indexes around n days of data and if there are best
  practices around this.  One thing I was thinking about was whether to keep
  a conveyor belt list of the slave shards and rotate them as needed and drop
  the master periodically and make its backup temporarily the master.
 
 
  Anyways would love to hear thoughts and usecases that are similar from the
  community.
 
  Regards
  

Re: Keeping a rolling window of indexes around solr

2013-05-28 Thread Saikat Kanjilal
Volume of data:
1 log insert every 30 seconds, queries done sporadically asynchronously every 
so often at a much lower frequency every few days

Also the majority of the requests are indeed going to be within a splice of 
time (typically hours or at most a few days)

Type of queries:
Keyword or termsearch
Search by guid (or id as known in the solr world)
Reserved or percolation queries to be executed when new data becomes available 
Search by dates as mentioned above

Regards


Sent from my iPhone

On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 : This is kind of the approach used by elastic search , if I'm not using 
 : solrcloud will I be able to use shard aliasing, also with this approach 
 : how would replication work, is it even needed?
 
 you haven't said much about hte volume of data you expect to deal with, 
 nor have you really explained what types of queries you intend to do -- 
 ie: you said you were intersted in a rolling window of indexes
 around n days of data but you never clarified why you think a 
 rolling window of indexes would be useful to you or how exactly you would 
 use it.
 
 The primary advantage of sharding by date is if you know that a large 
 percentage of your queries are only going to be within a small range of 
 time, and therefore you can optimize those requests to only hit the shards 
 neccessary to satisfy that small windo of time.
 
 if the majority of requests are going to be across your entire n days of 
 data, then date based sharding doesn't really help you -- you can just use 
 arbitrary (randomized) sharding using periodic deleteByQuery commands to 
 purge anything older then N days.  Query the whole collection by default, 
 and add a filter query if/when you want to restrict your search to only a 
 narrow date range of documents.
 
 this is the same general approach you would use on a non-distributed / 
 non-SolrCloud setup if you just had a single collection on a single master 
 replicated to some number of slaves for horizontal scaling.
 
 
 -Hoss
 


Keeping a rolling window of indexes around solr

2013-05-24 Thread Saikat Kanjilal
Hello Solr community folks,
I am doing some investigative work around how to roll and manage indexes inside 
our solr configuration, to date I've come up with an architecture that 
separates a set of masters that are focused on writes and get replicated 
periodically and a set of slave shards strictly docused on reads, additionally 
for each master index the design contains partial purges which get performed on 
each of the slave shards as well as the master to keep the data current.   
However the architecture seems a bit more complex than I'd like with a lot of 
moving pieces.  I was wondering if anyone has ever handled/designed an 
architecture around a conveyor belt or rolling window of indexes around n 
days of data and if there are best practices around this.  One thing I was 
thinking about was whether to keep a conveyor belt list of the slave shards and 
rotate them as needed and drop the master periodically and make its backup 
temporarily the master.


Anyways would love to hear thoughts and usecases that are similar from the 
community.

Regards   

RE: Keeping a rolling window of indexes around solr

2013-05-24 Thread Saikat Kanjilal
I would like to see something similar to this existing in the solr world or  I 
could gladly help create this:

https://github.com/karussell/elasticsearch-rollindex


We are evaluating both elasticsearch and our current solr architecture and need 
to manage write heavy use-cases within a rolling window.

 Date: Fri, 24 May 2013 09:07:38 -0600
 From: elyog...@elyograg.org
 To: solr-user@lucene.apache.org
 Subject: Re: Keeping a rolling window of indexes around solr
 
 On 5/24/2013 8:56 AM, Shawn Heisey wrote:
  On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
  Anyways would love to hear thoughts and usecases that are similar from the 
  community.
  
  Your use-case sounds a lot like what loggly was doing back in 2010.
  
  http://loggly.com/videos/lucene-revolution-2010/
 
 While I was writing that, I accidentally pressed the key combination
 that told my mail client to send the message before I was done.
 
 Loggly created a new shard every five minutes, and merged older shards
 to longer time intervals.  I personally don't need this capability, but
 it is a useful pattern.  I was wondering recently whether a custom
 document router could be built for SolrCloud that automatically manages
 time-divided shards - creating, merging, and if you're not keeping the
 data forever, deleting.
 
 Thanks,
 Shawn
 
  

Re: Keeping a rolling window of indexes around solr

2013-05-24 Thread Saikat Kanjilal
This is kind of the approach used by elastic search , if I'm not using 
solrcloud will I be able to use shard aliasing, also with this approach how 
would replication work, is it even needed?

Sent from my iPhone

On May 24, 2013, at 12:00 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Would collection aliasing help here? From Solr 4.2 release notes:
 Collection Aliasing. Got time based data? Want to re-index in a
 temporary collection and then swap it into production? Done. Stay
 tuned for Shard Aliasing.
 
 Regards,
  Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes 
 inside our solr configuration, to date I've come up with an architecture 
 that separates a set of masters that are focused on writes and get 
 replicated periodically and a set of slave shards strictly docused on reads, 
 additionally for each master index the design contains partial purges which 
 get performed on each of the slave shards as well as the master to keep the 
 data current.   However the architecture seems a bit more complex than I'd 
 like with a lot of moving pieces.  I was wondering if anyone has ever 
 handled/designed an architecture around a conveyor belt or rolling window 
 of indexes around n days of data and if there are best practices around 
 this.  One thing I was thinking about was whether to keep a conveyor belt 
 list of the slave shards and rotate them as needed and drop the master 
 periodically and make its backup temporarily the master.
 
 
 Anyways would love to hear thoughts and usecases that are similar from the 
 community.
 
 Regards