Re: When Index is Updated Frequently
Nearly 100ms? If any netizen ever complained about that, I'd 'round-file' the complaint. Internal to a single process's execution, well, mabye it's an issue. Not too hard to handle. Good job to the team that made it! From: Michael McCandless To: solr-user@lucene.apache.org; bing...@asu.edu Cc: Bing Li Sent: Fri, March 4, 2011 10:45:05 AM Subject: Re: When Index is Updated Frequently On Fri, Mar 4, 2011 at 10:09 AM, Bing Li wrote: > According to my experiences, when the Lucene index updated frequently, its > performance must become low. Is it correct? In fact Lucene can gracefully handle a high rate of updates with low latency turnaround on the readers, using the near-real-time (NRT) API -- IndexWriter.getReader() (or in soon-to-be 31, IndexReader.open(IndexWriter)). NRT is really something a hybrid of "eventual consistency" and "immediate consistency", because it lets your app have full control over how quickly changes must be visible by controlling when you pull a new NRT reader. That said, Lucene can't offer true immediate consistency at a high update rate -- the time to open a new NRT reader is usually too costly to do, eg, for every search. But eg every 100 msec (say) is reasonable (depending on many variables...). So... for your app you should run some tests and see. And please report back. (But, unfortunately, NRT hasn't been exposed in Solr yet...). -- Mike http://blog.mikemccandless.com
Re: GET or POST for large queries?
Probably you could do it, and solving a problem in business supersedes 'rightness' concerns, much to the dismay of geeks and 'those who like rightness and say the word "Neemph!" '. the not rightness about this is that: POST, PUT, DELETE are assumed to make changes to the URL's backend. GET is assumed NOT to make changes. So if your POST does not make a change . . . it breaks convention. But if it solves the problem . . . :-) Another way would be to GET with a 'query file' location, and then have the server fetch that query and execute it. Boy!!! I'd love to see one of your queries!!! You must have a few ANDs/ORs in them :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. From: mrw To: solr-user@lucene.apache.org Sent: Thu, February 17, 2011 11:27:06 AM Subject: GET or POST for large queries? We are running into some issues with large queries. Initially, they were ostensibly header buffer overruns, because increasing Jetty's headerBufferSize value to 65536 resolved them. This seems like a kludge, but it does solve the problem for 95% of our users. However, we do have queries that are physically larger than that and for which increasing the headerBufferSize to 65536 does not work. This is due to security requirements: Security descriptors are baked into the index, and then potentially thousands of them (depending on the user context) are passed in with each query. These excessive queries are only a problem with approximately 5% of users who are highly entitled, but the number of security descriptors in are likely to increase and we won't have a workaround for this security policy any time soon. After a lot of Googling, it seems to me that it's common to increase the headerBufferSize, but I don't see any other strategies. Is it possible/feasible to switch to use POST for querying? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/GET-or-POST-for-large-queries-tp2521700p2521700.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: My Plan to Scale Solr
What's an 'LSA' Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. From: Stijn Vanhoorelbeke To: solr-user@lucene.apache.org; bing...@asu.edu Sent: Thu, February 17, 2011 4:28:13 AM Subject: Re: My Plan to Scale Solr Hi, I'm currently looking at SolrCloud. I've managed to set up a scalable cluster with ZooKeeper. ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick understanding ) This way, all different shards / replicas are stored in a centralised configuration. Moreover the ZooKeeper contains out-of-the-box loadbalancing. So, lets say - you have 2 different shards and each is replicated 2 times. Your zookeeper config will look like this: \config ... /live_nodes (v=6 children=4) lP_Port:7500_solr (ephemeral v=0) lP_Port:7574_solr (ephemeral v=0) lP_Port:8900_solr (ephemeral v=0) lP_Port:8983_solr (ephemeral v=0) /collections (v=20 children=1) collection1 (v=0 children=1) "configName=myconf" shards (v=0 children=2) shard1 (v=0 children=3) lP_Port:8983_solr_ (v=4) "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/"; lP_Port:7574_solr_ (v=1) "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"; lP_Port:8900_solr_ (v=1) "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/"; shard2 (v=0 children=2) lP_Port:7500_solr_ (v=0) "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/"; lP_Port:7574_solr_ (v=1) "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"; --> This setup can be realised, by 1 ZooKeeper module - the other solr machines need just to know the IP_Port were the zookeeper is active & that's it. --> So no configuration / installing is needed to realise quick a scalable / load balanced cluster. Disclaimer: ZooKeeper is a relative new feature - I'm not sure if it will work out in a real production environment, which has a tight SLA pending. But - definitely keep your eyes on this stuff - this will mature quickly! Stijn Vanhoorelbeke
Re: Searching for negative numbers very slow
Is it my imagination or has this exact email been on the list already? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. From: Chris Hostetter To: solr-user@lucene.apache.org Cc: yo...@lucidimagination.com Sent: Wed, February 16, 2011 6:20:28 PM Subject: Re: Searching for negative numbers very slow : This was my first thought but -1 is relatively common but we have other : numbers just as common. i assume that when you say that you mean "...we have other numbers (that are not negative) just as common, (but searching for them is much faster)" ? I don't have any insight into why your negative numbers are slower, but FWIW... : Interestingly enough : : fq=uid:-1 : fq=foo:bar : fq=alpha:omega : : is much (4x) slower than : : q="uid:-1 AND foo:bar AND alpha:omega" ...this is (in and of itself) not that suprising for any three arbitrary disjoint queries. when a BoleanQuery is a full disjunction like this (all clause required) it can efficiently skip scoring a lot of documents by looping over the clauses, asking each one for the "next" doc they match, and then leap frogging the other clauses to that doc. in the case of the three "fq" params, each query is executd in isolatin, and *all* of the matches of each is accounted for. the speed of using distinct "fq" params in situations like this comes from the reuse after they are in the filterCache -- you can change fq=foo:bar to fq=foo:baz on the next query, and still reuse 2/3 of the work that was done on the first query. likewise if hte next query is fq=uid:-1&fq=foo:bar&fq=alpha:omegabeta then 2/3 of the work is already done again, and if a following query is fq=uid:-1&fq=foo:baz&fq=alpha:omegabeta then all of the work is already done and cached even though that particular request has never been seen by solr. -Hoss
Re: Title index to wiki
Please show me this link: http://wiki.apache.org/solr/TitleIndex On this page: http://wiki.apache.org/solr/ (where I said it would be a good idea) Or this page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters (selected at random) It's one thing to know that the titles can be searched, it's another to know what the topics are that can be searched for. Sorry if this is curt, I've worked a LOONG week. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. From: Markus Jelsma To: solr-user@lucene.apache.org Cc: Dennis Gearon Sent: Fri, February 11, 2011 8:07:24 AM Subject: Re: Title index to wiki What do you mean, there are two links to the Frontpage on each page. On Friday 11 February 2011 16:56:41 Dennis Gearon wrote: > I think it would be an improvement to the wikis if the link to the title > index were at the top of the index page of the wikis :-) I looked on that > index page & did not see that link on that page. Who's got > write access to wikis pages? > Sent from Yahoo! Mail on Android -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Title index to wiki
I think it would be an improvement to the wikis if the link to the title index were at the top of the index page of the wikis :-) I looked on that index page & did not see that link on that page. Who's got write access to wikis pages? Sent from Yahoo! Mail on Android
Wikipedia table of contents.
Is there a detailed, perhaps alphabetical & hierarchical table of contents for all ether wikis on the sole site? Sent from Yahoo! Mail on Android
Re: dynamic fields revisited
I have a long way to go to understand all those implications. Mind you, I never -was- whining :-). Just ignorantly surprised. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. From: Markus Jelsma To: solr-user@lucene.apache.org Cc: gearond Sent: Mon, February 7, 2011 3:28:18 PM Subject: Re: dynamic fields revisited It would be quite annoying if it behaves as you were hoping for. This way it is possible to use different field types (and analyzers) for the same field value. In faceting, for example, this can be important because you should use analyzed fields for q and fq but unanalyzed fields for facet.field. The same goes for sorting and range queries where you can use the same field value to end up in different field types, one for sorting and one for a range query. Without the prefix or suffix of the dynamic field, one must statically declare the fields beforehand and loose the dynamic advantage. > Just so anyone else can know and save themselves 1/2 hour if they spend 4 > minutes searching. > > When putting a dynamic field into a document into an index, the name of the > field RETAINS the 'constant' part of the dynamic field name. > > Example > - > If a dynamic integer field is named '*_i' in the schema.xml file, > __and__ > you insert a field names 'my_integer_i', which matches the globbed field > name '*_i', > __then__ > the name of the field will be 'my_integer_i' in the index > and in your GETs/(updating)POSTs to the index on that document and > __NOT__ > 'my_integer' like I was kind of hoping that it would be :-( > > I.E., the suffix (or prefix if you set it up that way,) will NOT be > dropped. I was hoping that everything except the globbing character, '*', > would just be a flag to the query processor and disappear after being > 'noticed'. > > Not so :-)
Re: Optimize seaches; business is progressing with my Solr site
Hmmm, my default distance for geospatial was excluding the results, I believe. I have to check to see if I was actually looking at the desired return result for 'ballroom' alone. Mabye I wasn't. But I saw a lot to learn when I applied the techniques you gave me. Thank you :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. From: Erick Erickson To: solr-user@lucene.apache.org Sent: Sun, February 6, 2011 8:21:15 AM Subject: Re: Optimize seaches; business is progressing with my Solr site What does &debugQuery=on give you? Second, what optimizatons are you doing? What shows up in they analysis page? does your admin page show the terms in your copyfield you expect? Best Erick On Sun, Feb 6, 2011 at 2:03 AM, Dennis Gearon wrote: > Thanks to LOTS of information from you guys, my site is up and working. > It's > only an API now, I need to work on my OWN front end, LOL! > > I have my second customer. My general purpose repository API is very useful > I'm > finding. I will soon be in the business of optimizing the search engine > part. > > > For example. I have a copy field that has the words, 'boogie woogie > ballroom' on > lots of records in the copy field. I cannot find those records using > 'boogie/boogi/boog', or the woogie versions of those, but I can with > ballroom. > For my VERY first lesson in optimization of search, what might be causing > that, > and where are the places to read on the Solr site on this? > > All the best on a Sunday, guys and gals. > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them > yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. >
Optimize seaches; business is progressing with my Solr site
Thanks to LOTS of information from you guys, my site is up and working. It's only an API now, I need to work on my OWN front end, LOL! I have my second customer. My general purpose repository API is very useful I'm finding. I will soon be in the business of optimizing the search engine part. For example. I have a copy field that has the words, 'boogie woogie ballroom' on lots of records in the copy field. I cannot find those records using 'boogie/boogi/boog', or the woogie versions of those, but I can with ballroom. For my VERY first lesson in optimization of search, what might be causing that, and where are the places to read on the Solr site on this? All the best on a Sunday, guys and gals. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: prices
That's a good idea, Yonik. So, fields that aren't stored don't get displayed, so the float field in the schema never gets seen by the user. Good, I like it. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Yonik Seeley To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 10:49:42 AM Subject: Re: prices On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon wrote: > Using solr 1.4. > > I have a price in my schema. Currently it's a tfloat. Somewhere along the way > from php, json, solr, and back, extra zeroes are getting truncated along with > the decimal point for even dollar amounts. > > So I have two questions, neither of which seemed to be findable with google. > > A/ Any way to keep both zeroes going inito a float field? (In the analyzer, >with > XML output, the values are shown with 1 zero) > B/ Can strings be used in range queries like a float and work well for prices? You could do a copyField into a stored string field and use the tfloat (or tint and store cents) for range queries, searching, etc, and the string field just for display. -Yonik http://lucidimagination.com > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
prices
Using solr 1.4. I have a price in my schema. Currently it's a tfloat. Somewhere along the way from php, json, solr, and back, extra zeroes are getting truncated along with the decimal point for even dollar amounts. So I have two questions, neither of which seemed to be findable with google. A/ Any way to keep both zeroes going inito a float field? (In the analyzer, with XML output, the values are shown with 1 zero) B/ Can strings be used in range queries like a float and work well for prices? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: chaning schema
Well, the nice thing is that I have an Amazon based dev server, and it's AMI stored. So if I screw something up, I just throw away that server and get a fresh one all configured and full of dev data and BAM back to where I was. So I'll try it again with the -rf flags. I did shut down the server and I am using Tomcat. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Gora Mohanty To: solr-user@lucene.apache.org Sent: Thu, February 3, 2011 6:56:29 AM Subject: Re: chaning schema On Thu, Feb 3, 2011 at 6:47 PM, Erick Erickson wrote: > Erik: > > Is this a Tomcat-specific issue? Because I regularly delete just the > data/index directory on my Windows > box running Jetty without any problems. (3_x and trunk) > > Mostly want to know because I just encouraged someone to just delete the > index dir based on my > experience... > > Thanks > Erick > > On Tue, Feb 1, 2011 at 12:24 PM, Erik Hatcher wrote: > >> the trick is, you have to remove the data/ directory, not just the >> data/index subdirectory. and of course then restart Solr. >> >> or delete *:*?commit=true, depending on what's the best fit for your ops. >> >>Erik >> >> On Feb 1, 2011, at 11:41 , Dennis Gearon wrote: >> >> > I tried removing the index directory once, and tomcat refused to sart up >> because >> > it didn't have a segments file. [...] I have seen this error with Tomcat, but in my experience, this has been due to doing a "rm data/index/*" rather than "rm -rf /data/index", or due to doing this without first shutting down Tomcat. Regards, Gora
MANY thanks for help on path so far (first of 2 steps on 1000step path :-)
Got my API to input into both the database and the Solr instance, search geograhically/chronologically in Solr. Next is Update and Delete. And then .. and then ... and then .. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Time fields
For time of day fields, NOT unix timestamp/dates, what is the best way to do that? I can think of seconds since beginning of day as integer OR string Any other ideas? Assume that I'll be using range queries. TIA. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: OAI on SOLR already done?
I guess I didn't understand 'meta data'. That's why I asked the question. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Wed, February 2, 2011 2:26:32 PM Subject: Re: OAI on SOLR already done? On 2/2/2011 5:19 PM, Dennis Gearon wrote: > Does something like this work to extract dates, phone numbers, addresses across > international formats and languages? > > Or, just in the plain ol' USA? What are you talking about? There is nothing discussed in this thread that does any 'extracting' of dates, phone numbers or addresses at all , whether in international or domestic formats.
Re: OAI on SOLR already done?
Does something like this work to extract dates, phone numbers, addresses across international formats and languages? Or, just in the plain ol' USA? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Demian Katz To: "solr-user@lucene.apache.org" Cc: Paul Libbrecht Sent: Wed, February 2, 2011 12:40:58 PM Subject: RE: OAI on SOLR already done? I already replied to the original poster off-list, but it seems that it may be worth weighing in here as well... The next release of VuFind (http://vufind.org) is going to include OAI-PMH server support. As you say, there is really no way to plug OAI-PMH directly into Solr... but a tool like VuFind can provide a fairly generic, extensible, Solr-based platform for building an OAI-PMH server. Obviously this is helpful for some use cases and not others... but I'm happy to provide more information if anyone needs it. - Demian From: Jonathan Rochkind [rochk...@jhu.edu] Sent: Wednesday, February 02, 2011 3:38 PM To: solr-user@lucene.apache.org Cc: Paul Libbrecht Subject: Re: OAI on SOLR already done? The trick is that you can't just have a generic black box OAI-PMH provider on top of any Solr index. How would it know where to get the metadata elements it needs, such as title, or last-updated date, etc. Any given solr index might not even have this in stored fields -- and a given app might want to look them up from somewhere other than stored fields. If the Solr index does have them in stored fields, and you do want to get them from the stored fields, then it's, I think (famous last words) relatively straightforward code to write. A mapping from solr stored fields to metadata elements needed for OAI-PMH, and then simply outputting the XML template with those filled in. I am not aware of anyone that has done this in a re-useable/configurable-for-your-solr tool. You could possibly do it solely using the built-in Solr JSP/XSLT/other-templating-stuff-I-am-not-familiar-with stuff, rather than as an external Solr client app, or it could be an external Solr client app. This is actually a very similar problem to something someone else asked a few days ago "Does anyone have an OpenSearch add-on for Solr?" Very very similar problem, just with a different XML template for output (usually RSS or Atom) instead of OAI-PMH. On 2/2/2011 3:14 PM, Paul Libbrecht wrote: > Peter, > > I'm afraid your service is harvesting and I am trying to look at a PMH > provider >service. > > Your project appeared early in the goolge matches. > > paul > > > Le 2 févr. 2011 à 20:46, Péter Király a écrit : > >> Hi, >> >> I don't know whether it fits to your need, but we are builing a tool >> based on Drupal (eXtensible Catalog Drupal Toolkit), which can harvest >> with OAI-PMH and index the harvested records into Solr. The records is >> harvested, processed, and stored into MySQL, then we index them into >> Solr. We created some ways to manipulate the original values before >> sending to Solr. We created it in a modular way, so you can change >> settings in an admin interface or write your own "hooks" (special >> Drupal functions), to taylor the application to your needs. We support >> only Dublin Core, and our own FRBR-like schema (called XC schema), but >> you can add more schemas. Since this forum is about Solr, and not >> applications using Solr, if you interested this tool, plase write me a >> private message, or visit http://eXtensibleCatalog.org, or the >> module's page at http://drupal.org/project/xc. >> >> Hope this helps, >> >> Péter >> eXtensible Catalog >> >> 2011/2/2 Paul Libbrecht: >>> Hello list, >>> >>> I've met a few google matches that indicate that SOLR-based servers >>> implement >>>the Open Archive Initiative's Metadata Harvesting Protocol. >>> >>> Is there something made to be re-usable that would be an add-on to solr? >>> >>> thanks in advance >>> >>> paul >
Re: chaning schema
Cool, thanks for the tip, Erik :-) There's so much to learn, and I haven't even got to tuning the thing for best results. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erik Hatcher To: solr-user@lucene.apache.org Sent: Tue, February 1, 2011 9:24:24 AM Subject: Re: chaning schema the trick is, you have to remove the data/ directory, not just the data/index subdirectory. and of course then restart Solr. or delete *:*?commit=true, depending on what's the best fit for your ops. Erik On Feb 1, 2011, at 11:41 , Dennis Gearon wrote: > I tried removing the index directory once, and tomcat refused to sart up >because > > it didn't have a segments file. > > > > > - Original Message > From: Erick Erickson > To: solr-user@lucene.apache.org > Sent: Tue, February 1, 2011 5:04:51 AM > Subject: Re: chaning schema > > That sounds right. You can cheat and just remove /data/index > rather than delete *:* though (you should probably do that with the Solr > instance stopped) > > Make sure to remove the directory "index" as well. > > Best > Erick > > On Tue, Feb 1, 2011 at 1:27 AM, Dennis Gearon wrote: > >> Anyone got a great little script for changing a schema? >> >> i.e., after changing: >> database, >> the view in the database for data import >> the data-config.xml file >> the schema.xml file >> >> I BELIEVE that I have to run: >> a delete command for the whole index *:* >> a full import and optimize >> >> This all sound right? >> >> Dennis Gearon >> >> >> Signature Warning >> >> It is always a good idea to learn from your own mistakes. It is usually a >> better >> idea to learn from others’ mistakes, so you do not have to make them >> yourself. >> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >> >> >> EARTH has a Right To Life, >> otherwise we all die. >> >> >
Re: chaning schema
I tried removing the index directory once, and tomcat refused to sart up because it didn't have a segments file. - Original Message From: Erick Erickson To: solr-user@lucene.apache.org Sent: Tue, February 1, 2011 5:04:51 AM Subject: Re: chaning schema That sounds right. You can cheat and just remove /data/index rather than delete *:* though (you should probably do that with the Solr instance stopped) Make sure to remove the directory "index" as well. Best Erick On Tue, Feb 1, 2011 at 1:27 AM, Dennis Gearon wrote: > Anyone got a great little script for changing a schema? > > i.e., after changing: > database, > the view in the database for data import > the data-config.xml file > the schema.xml file > > I BELIEVE that I have to run: > a delete command for the whole index *:* > a full import and optimize > > This all sound right? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them > yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
chaning schema
Anyone got a great little script for changing a schema? i.e., after changing: database, the view in the database for data import the data-config.xml file the schema.xml file I BELIEVE that I have to run: a delete command for the whole index *:* a full import and optimize This all sound right? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
first search on index
So, is it normal for the first search against a freshly made index to return nothing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
filed names for solr spatial
I would love it if I could use 'latitude' and 'longitude' in all places. But it seems that solr spatial for 1.4 plugin only works with lat/lng. Any way to change that? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: get SOMETHING out of an index
WEll, this is the query that USED to work, before we massaged the schema, (*I* did). solr/select?wt=json&indent=true&start=0&rows=20&q={!spatial lat=37.221293 long=-121.979192 radius=1000 unit=km threadCount=3} *:* WHOOPS!!! Just for fun, after spending HOURS screwing around with exceptions, after following some bad directions on the web to just delete the index directory first to do a new data import, i tried the query above and now it works. I don't know enough to know why. To get it working, I copied an index directory from another instance with an incorrect schema, issued a delete all command *:*, then did the data import and optimize, and voila! Along the way, I had to change the owner and group of the replaced ../index directory and files back to tomcat6. I THINK that I had one of the 'lng' fields in one of the three config files of interest as 'long'. I'll ask some questons about that in the next email. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Estrada Groups To: "solr-user@lucene.apache.org" Cc: "solr-user@lucene.apache.org" Sent: Sat, January 29, 2011 9:35:56 PM Subject: Re: get SOMETHING out of an index It would be really helpful to send along your schem.XML file so we can see how you are indexing these points. Polygons and linestrings are not supported yet. Another good way to test is using the Solr/admin tool or hand jamming your params in manually. Type *:* as your query in the admin tool. And see what it returns. It should return all indexed fields and their values. Keep in mind that your radius search as to be done on the field type solr.LatLong so check out the field called stores in the example config file. From there you cam start to build out the rest of your queries starting with {!type=geofilt} I have example code that I can send along tomorrow. For the Solr/Lucene contributors out there, was it the point of storing lats and longs in individual fields if they can't really be used for anything? I they can please gimme an example that uses solr.point type. Adam Sent from my iPhone On Jan 29, 2011, at 11:09 PM, Dennis Gearon wrote: > I indexed my whole database (only 52k records). > > It has some geospatioal on it. I set the geospatial to 1000km radius to >centered > > on the town where they all are, and NADA comes out. > > How can I find out what's in the index and get at least ONE document out? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > > idea to learn from others’ mistakes, so you do not have to make them > yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. >
get SOMETHING out of an index
I indexed my whole database (only 52k records). It has some geospatioal on it. I set the geospatial to 1000km radius to centered on the town where they all are, and NADA comes out. How can I find out what's in the index and get at least ONE document out? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: match count per shard and across shards
Sounds like the interface level to achieve this is multiple indexes. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Upayavira To: solr-user@lucene.apache.org Sent: Sat, January 29, 2011 3:51:45 PM Subject: Re: match count per shard and across shards To my knowledge, the distributed search functionality is intended to be transparent, thus no details deriving from it are exposed (e.g. what docs come from which shard), so, no, I don't believe it to be possible. The only way I know right now that you could achieve it is by two (sets of) queries. One would be a distributed search across all shards, and the other would be a single hit to every shard. To fake such a facet, this second set of queries would only need to ask for totals, so it could use a rows=0. Otherwise you'd have to enhance the distributed indexing code to expose some of this information in its response. Upayavira On Sat, 29 Jan 2011 03:48 -0800, "csj" wrote: > > Hi, > > Is it possible to construct a Solr query that will return the total > number > of hits there across all shards, and at the same time getting the number > of > hits per shard? > > I was thinking along the lines of a faceted search, but I'm not deep > enough > into Solr capabilities and query parameters to figure it out. > > Regards, > > Christian Sonne Jensen > > -- > View this message in context: >http://lucene.472066.n3.nabble.com/match-count-per-shard-and-across-shards-tp2369627p2369627.html >l > Sent from the Solr - User mailing list archive at Nabble.com. > --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Thoughts on USING dynamic fields for extending objects
Well, mid, next month we're going to start using dynamic fields as the relate to our business rules. Basically, in involves have a basic set of objects in code/database, and flattened for search in Solr. The MAIN business object is to be extendable by the custoemer, will still having to supply the required fields in the base object. We will use defined type, dynamic fields I had a question for those more experienced than I. We are thinking about two possible usage patterns: A/ User can add any field they want, as long as they use the right suffix for the field. Changing the schema can be done at will, and updating past objects is totally on the user. They get: 1/ Find within the field. 2/ Range queries 3/ other future, single field functionality later 4 B/ User can NOT add any field they want, but they must submit a schema, hopefully automated. The data still goes into the Solr index as dynamically accepted fields as long as they use the right suffix for the field. Changing the schema can be done at by submitting the new schema. Updating past objects is STILL totally on the user. They get: 1/ Find within the field. 2/ Range queries 3/ Various filter functions like: mandatory fields, acceptable ranges, minimum lengths on strings, and other processing. 4/ Other future, single field functionality later 5/ The ability to make their own copyfields for 'grouping' of their own fields. 'A' I see as most simple to administer, but possible has security holes? THAT's my main question, all thoughts welcom. 'B' is better as a value added service, but has a LOT more work on our site's end, I believe. We could also possibly do non acceptance of sensitive field names for security? Any thoughts much appreciateed. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Solr for noSQL
Personally, I just create a view that flattens out the database and renames the fields as I desire. Then I call the view with the DIH to import it. Solr doesn't knwo anything about the databsae, except how to get a connection and fetch rows. And that's pretty darn useful, just that much less code to write. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Upayavira To: solr-user@lucene.apache.org Sent: Fri, January 28, 2011 1:41:42 AM Subject: Re: Solr for noSQL On Thu, 27 Jan 2011 21:38 -0800, "Dennis Gearon" wrote: > Why not make one's own DIH handler, Lance? Personally, I don't like that approach. Solr is best related to as something of a black box that you configure, then push content to. Having Solr know about your data sources, and pull content in seems to me to be mixing concerns. I relate to the DIH as a useful tool for smaller sites or for prototyping, but would expect anything more substantial to require an indexing application that gives you full control over the indexing process. It could be a lightweight app that uses a MongoDB java client and SolrJ, and simply pulls from one and pushes to the other. If you don't want to run another JVM, it could run as a separate webapp within your Solr JVM. From an architectural point of view, do you configure Mysql, or MongoDB for that matter, to pull content into itself? Likewise, Solr should be a service that listens, waiting to be given data. Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Does solr supports indexing of files other than UTF-8
Use ICONV library in your server side language. Convert it to utf-8, store it with a filed describing what incoding it was in, and re encode it if you wish. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: prasad deshpande To: solr-user@lucene.apache.org Sent: Fri, January 28, 2011 12:41:29 AM Subject: Re: Does solr supports indexing of files other than UTF-8 Thanks paul. However I want to support local encoding files to be indexed. How would I achieve it? On Thu, Jan 27, 2011 at 2:46 PM, Paul Libbrecht wrote: > At least in java utf-8 transcoding is done on a stream basis. No issue > there. > > paul > > > Le 27 janv. 2011 à 09:51, prasad deshpande a écrit : > > > The size of docs can be huge, like suppose there are 800MB pdf file to > index > > it I need to translate it in UTF-8 and then send this file to index. Now > > suppose there can be any number of clients who can upload file. at that > time > > it will affect performance. and already our product support localization > > with local encoding. > > > > Thanks, > > Prasad > > > > On Thu, Jan 27, 2011 at 2:04 PM, Paul Libbrecht > wrote: > > > >> Why is converting documents to utf-8 not feasible? > >> Nowadays any platform offers such services. > >> > >> Can you give a detailed failure description (maybe with the URL to a > sample > >> document you post)? > >> > >> paul > >> > >> > >> Le 27 janv. 2011 à 07:31, prasad deshpande a écrit : > >>> I am able to successfully index/search non-Engilsh data(like Hebrew, > >>> Japnese) which was encoded in UTF-8. > >>> However, When I tried to index data which was encoded in local encoding > >> like > >>> Big5 for Japanese I could not see the desired results. > >>> The contents after indexing looked garbled for Big5 encoded document > when > >> I > >>> searched for all indexed documents. > >>> > >>> Converting a complete document in UTF-8 is not feasible. > >>> I am not very clear about how Solr support these localizations with > other > >>> than UTF-8 encoding. > >>> > >>> > >>> I verified below links > >>> 1. http://lucene.apache.org/java/3_0_3/api/all/index.html > >>> 2. http://wiki.apache.org/solr/LanguageAnalysis > >>> > >>> Thanks and Regards, > >>> Prasad > >> > >> > >
Re: Solr for noSQL
Why not make one's own DIH handler, Lance? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Thu, January 27, 2011 9:33:25 PM Subject: Re: Solr for noSQL There no special connectors available to read from the key-value stores like memcache/cassandra/mongodb. You would have to get a Java client library for the DB and code your own dataimporthandler datasource. I cannot recommend this; you should make your own program to read data and upload to Solr with one of the Solr client libraries. Lance On 1/27/11, Jianbin Dai wrote: > Hi, > > > > Do we have data import handler to fast read in data from noSQL database, > specifically, MongoDB I am thinking to use? > > Or a more general question, how does Solr work with noSQL database? > > Thanks. > > > > Jianbin > > > > -- Lance Norskog goks...@gmail.com
Re: How to group result when search on multiple fields
Thsi is probably either 'shingling' or 'facets'. Someone more experienced can verify that or add more details. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: cyang2010 To: solr-user@lucene.apache.org Sent: Wed, January 26, 2011 3:35:47 PM Subject: How to group result when search on multiple fields Let me give an example to illustrate my question: On netflix site, the search box allow you to search by movie, tv shows, actors, directors, and genres. If "Tomcat" is searched, it gives result as: move titles with "Tomcat" or whatever, and somewhere in between , it also show two actors, "Tom Cruise" and "Tom Hanks". Then followed by a lot of other movies titles. If this is all based on the same type of index document (titles that has title name, associated actors, directors, and genres), then search result are all titles. How is it able to render matching actors as part of the result. In other word, how does it tell some movie are returned because of actor match? If it is implemented as two different type of index document. One document type for title (name, actors, directors ...), the other is for actor (actor name, movie/tv titles). How does it merge result? As far as i notice, sometimes actors name can appear anywhere in search result as a group. Is it just comaring the score of the first actor document with that of title match result, and then decide where to insert the actor match result? Well, that can be inaccurate, right? Score from two different type of document are not comparable right? Let me know what your thought on this. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-group-result-when-search-on-multiple-fields-tp2358441p2358441.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: in-index representaton of tokens
I am saying there is a list of tokens that have been parsed (a table of them) for each column? Or one for the whole index? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Tue, January 25, 2011 9:29:36 AM Subject: Re: in-index representaton of tokens Why does it matter? You can't really get at them unless you store them. I don't know what "table per column" means, there's nothing in Solr architecture called a "table" or a "column". Although by column you probably mean more or less Solr "field". There is nothing like a "table" in Solr. Solr is still not an rdbms. On 1/25/2011 12:26 PM, Dennis Gearon wrote: > So, the index is a list of tokens per column, right? > > There's a table per column that lists the analyzed tokens? > > And the tokens per column are represented as what, system integers? 32/64 bit > unsigned ints? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. >
in-index representaton of tokens
So, the index is a list of tokens per column, right? There's a table per column that lists the analyzed tokens? And the tokens per column are represented as what, system integers? 32/64 bit unsigned ints? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: DIH serialize
Depends on your process chain to the eventual viewer/consumer of the data. The questions to ask are: A/ Is the data IN Solr going to be viewed or processed in its original form: -->set stored = 'true' --->no serialization needed. B/ If it's going to be anayzed and searched for separate from any other field, the analyzing will put it into an unreadable form. If you need to see it, then --->set indexed="true" and stored="true" --->no serializaton needed. C/ If it's NOT going to be viewed AS IS, and it's not going to be searched for AS IS, (i.e. other columns will be how the data is found), and you have another, serialzable format: -->set indexed="false" and stored="true" -->serialize AS PER THE INTENDED APPLICATION, not sure that Solr can do that at all. C/ If it's NOT going to be viewed AS IS, and it's not going to be searched for AS IS, (i.e. other columns will be how the data is found), and you have another, serialzable format: -->set indexed="false" and stored="true" -->serialize AS PER THE INTENDED APPLICATION, not sure that Solr can do that at all. D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched for AS IS, (this column will be how the data is found), and you have another, serialzable format: -->you need to put it into TWO columns -->A SERIALIZED FIELD -->set indexed="false" and stored="true" -->>AN UNSERIALIZED FIELD -->set indexed="false" and stored="true" -->serialize AS PER THE INTENDED APPLICATION, not sure that Solr can do that at all. Hope that helps! Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Papp Richard To: solr-user@lucene.apache.org Sent: Sun, January 23, 2011 2:02:05 PM Subject: DIH serialize Hi all, I wasted the last few hours trying to serialize some column values (from mysql) into a Solr column, but I just can't find such a function. I'll use the value in PHP - I don't know if it is possible to serialize in PHP style at all. This is what I tried and works with a given factor: in schema.xml: in DIH xml: <![CDATA[ function my_serialize(row) { row.put('main_timetable', row.toString()); return row; } ]]> . . > Can I use java directly in script (
one last questoni on dynamic fields
Is it possible to use ONE definition of a dynamic field type for inserting mulitple dynamic fields of that type with different names? Or do I need a seperate dynamic field definition for each eventual field? Can I do this? . . and then doing for insert all their values 9802490824908 9809084 09845970011 09874523459870 Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: filter update by IP
Most times people do this by running solr ONLY local host, and running some kind of permission scheme through a server site application. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erik Hatcher To: solr-user@lucene.apache.org Sent: Sun, January 23, 2011 10:47:02 AM Subject: Re: filter update by IP No. SolrQueryRequest doesn't (currently) have access to the actual HTTP request coming in. You'll need to do this either with a servlet filter and register it into web.xml or restrict it from some other external firewall'ish technology. Erik On Jan 23, 2011, at 13:21 , Teebo wrote: > Hi > > I would like to restrict access to /update/csv request handler > > Is there a ready to use UpdateRequestProcessor for that ? > > > My first idea was to heritate from CSVRequestHandler and to overload > public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) { > ... > restrict by IP code > ... > super(req, rsp); > } > > What do you think ? > > Regards, > t.
Re: api key filtering
Totally agree, do it at indexing time, in the index. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Sat, January 22, 2011 5:28:50 PM Subject: RE: api key filtering If you COULD solve your problem by indexing 'public', or other tokens from a limited vocabulary of document roles, in a field -- then I'd definitely suggest you look into doing that, rather than doing odd things with Solr instead. If the only barrier is not currently having sufficient logic at the indexing stage to do that, then it is going to end up being a lot less of a headache in the long term to simply add a layer at the indexing stage to add that in, then trying to get Solr to do things outside of it's, well, 'comfort zone'. Of course, depending on your requirements, it might not be possible to do that, maybe you can't express the semantics in terms of a limited set of roles applied to documents. And then maybe your best option really is sending an up to 2k element list (not exactly the same list every time, presumably) of acceptable documents to Solr with every query, and maybe you can get that to work reasonably. Depending on how many different complete lists of documents you have, maybe there's a way to use Solr caches effectively in that situation, or maybe that's not even neccesary since lookup by unique id should be pretty quick anyway, not really sure. But if the semantics are possible, much better to work with Solr rather than against it, it's going to take a lot less tinkering to get Solr to perform well if you can just send an fq=role:public or something, instead of a list of document IDs. You won't need to worry about it, it'll just work, because you know you're having Solr do what it's built to do. Totally worth a bit of work to add a logic layer at the indexing stage. IMO. From: Erick Erickson [erickerick...@gmail.com] Sent: Saturday, January 22, 2011 4:50 PM To: solr-user@lucene.apache.org Subject: Re: api key filtering 1024 is the default number, it can be increased. See MaxBooleanClauses in solrconfig.xml This shouldn't be a problem with 2K clauses, but expanding it to tens of thousands is probably a mistake (but test to be sure). Best Erick On Sat, Jan 22, 2011 at 3:50 PM, Matt Mitchell wrote: > Hey thanks I'll definitely have a read. The only problem with this though, > is that our api is a thin layer of app-code, with solr only (no db), we > index data from our sql db into solr, and push the index off for > consumption. > > The only other idea I had was to send a list of the allowed document ids > along with every solr query, but then I'm sure I'd run into a filter query > limit. Each key could be associated with up to 2k documents, so that's 2k > values in an fq which would probably be too many for lucene (I think its > limit 1024). > > Matt > > On Sat, Jan 22, 2011 at 3:40 PM, Dennis Gearon >wrote: > > > The only way that you would have that many api keys per record, is if one > > of > > them represented 'public', right? 'public' is a ROLE. Your answer is to > use > > RBAC > > style techniques. > > > > > > Here are some links that I have on the subject. What I'm thinking of > doing > > is: > > Sorry for formatting, Firefox is freaking out. I cut and pasted these > from > > an > > email from my sent box. I hope the links came out. > > > > > > Part 1 > > > > > > >http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ >/ > > > > > > Part2 > > Role-based access control in SQL, part 2 at Xaprb > > > > > > > > > > > > ACL/RBAC Bookmarks ALL > > > > UserRbac - symfony - Trac > > A Role-Based Access Control (RBAC) system for PHP > > Appendix C: Task-Field Access > > Role-based access control in SQL, part 2 at Xaprb > > PHP Access Control - PHP5 CMS Framework Development | PHP Zone > > Linux file and directory permissions > > MySQL :: MySQL 5.0 Reference Manual :: C.5.4.1 How to Reset the Root > > Password > > per RECORD/Entity permissions? - symfony users | Google Groups > > Special Topics: Authentication and Authorization | The Definitive Guide > to > > Yii | > > Yii Framework >
Re: api key filtering
Got it, here are the links that I have on RBAC/ACL/Access Control. Some of these are specific to Solr. http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ http://www.xaprb.com/blog/2006/08/18/role-based-access-control-in-sql-part-2/ http://php.dzone.com/articles/php-access-control?page=0,1 http://www.tonymarston.net/php-mysql/role-based-access-control.html http://www.tonymarston.net/php-mysql/menuguide/appendixc.html http://php.dzone.com/articles/php-access-control?page=0,1 http://trac.symfony-project.org/wiki/UserRbac http://www.tonymarston.net/php-mysql/role-based-access-control.html http://www.tonymarston.net/php-mysql/menuguide/appendixc.html http://trac.symfony-project.org/wiki/UserRbac http://code.google.com/p/kohana-mptt/source/browse/trunk/acl/libraries/Acl.php?r=82 http://www.oracle.com/technetwork/articles/javaee/ajax-135201.html http://phpgacl.sourceforge.net/ http://www.java2s.com/Code/Java/GWT/ClassthatactsasaclienttoaJSONservice.htm http://dev.w3.org/perl/modules/W3C/Rnodes/bin/makeAclTables.sql http://dev.juokaz.com/ http://dev.w3.org/perl/modules/W3C/Rnodes/bin/makeAclTables.sql http://stackoverflow.com/questions/54230/cakephp-acl-database-setup-aro-aco-structure http://phpgacl.sourceforge.net/ http://blog.reardonsoftware.com/2010/07/spring-security-acl-schema-for-oracle.html http://www.mail-archive.com/symfony-users@googlegroups.com/msg29537.html http://www.schemaweb.info/schema/SchemaInfo.aspx?id=167 http://www.assembla.com/code/backendpro/subversion/nodes/trunk/modules/auth/libraries/Khacl.php?rev=169 http://framework.zend.com/wiki/display/ZFUSER/Using+Zend_Acl+with+a+database+backend http://www.w3.org/2001/04/20-ACLs#Structure http://lucene.472066.n3.nabble.com/Modelling-Access-Control-td1756817.html#a1759372 http://www.tonymarston.net/php-mysql/role-based-access-control.html http://phpgacl.sourceforge.net/ http://jmcneese.wordpress.com/2009/04/05/row-level-model-access-control-for-cakephp/#comment-112 http://jmcneese.wordpress.com/2009/04/05/row-level-model-access-control-for-cakephp/ http://www.xaprb.com/blog/2006/08/18/role-based-access-control-in-sql-part-2/ http://php.dzone.com/articles/php-access-control?page=0,1 https://issues.apache.org/jira/browse/SOLR-1834 http://www.tonymarston.net/php-mysql/role-based-access-control.html http://php.dzone.com/articles/php-access-control?page=0,1 http://www.yiiframework.com/doc/guide/1.1/en/topics.auth#role-based-access-control http://lucene.472066.n3.nabble.com/Modelling-Access-Control-td1756817.html#a1759372 http://phpgacl.sourceforge.net/ http://jmcneese.wordpress.com/2009/04/05/row-level-model-access-control-for-cakephp/#comment-112 http://jmcneese.wordpress.com/2009/04/05/row-level-model-access-control-for-cakephp/ http://www.yiiframework.com/doc/guide/topics.auth#role-based-access-control - Original Message From: Dennis Gearon To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 1:22:04 PM Subject: Re: api key filtering Dang! There were hot, clickable links in the web mail I put them in. I guess you guys can search for those strings on google and find them. Sorry. - Original Message From: Dennis Gearon To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 1:09:26 PM Subject: Re: api key filtering The links didn't work, so here the are again, NOT from a sent folder: PHP Access Control - PHP5 CMS Framework Development | PHP Zone A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb PHP Access Control - PHP5 CMS Framework Development | PHP Zone UserRbac - symfony - Trac A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb UserRbac - symfony - Trac Acl.php - kohana-mptt - Project Hosting on Google Code CANDIDATE-PHP Generic Access Control Lists http://dev.w3.org/perl/modules/W3C/Rnodes/bin/makeAclTables.sql makeAclTables.sql php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow PHP Generic Access Control Lists Reardon's Ruminations: Spring Security ACL Schema for Oracle Re: [symfony-users] Implementing an existing ACL API in symfony SchemaWeb - Classes And Properties - ACL Schema trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla Using Zend_Acl with a database backend - Zend Framework Wiki W3C ACL System Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Matt Mitchell To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 12:50:24 PM Subject: Re: api key fi
Re: api key filtering
Dang! There were hot, clickable links in the web mail I put them in. I guess you guys can search for those strings on google and find them. Sorry. - Original Message From: Dennis Gearon To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 1:09:26 PM Subject: Re: api key filtering The links didn't work, so here the are again, NOT from a sent folder: PHP Access Control - PHP5 CMS Framework Development | PHP Zone A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb PHP Access Control - PHP5 CMS Framework Development | PHP Zone UserRbac - symfony - Trac A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb UserRbac - symfony - Trac Acl.php - kohana-mptt - Project Hosting on Google Code CANDIDATE-PHP Generic Access Control Lists http://dev.w3.org/perl/modules/W3C/Rnodes/bin/makeAclTables.sql makeAclTables.sql php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow PHP Generic Access Control Lists Reardon's Ruminations: Spring Security ACL Schema for Oracle Re: [symfony-users] Implementing an existing ACL API in symfony SchemaWeb - Classes And Properties - ACL Schema trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla Using Zend_Acl with a database backend - Zend Framework Wiki W3C ACL System Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Matt Mitchell To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 12:50:24 PM Subject: Re: api key filtering Hey thanks I'll definitely have a read. The only problem with this though, is that our api is a thin layer of app-code, with solr only (no db), we index data from our sql db into solr, and push the index off for consumption. The only other idea I had was to send a list of the allowed document ids along with every solr query, but then I'm sure I'd run into a filter query limit. Each key could be associated with up to 2k documents, so that's 2k values in an fq which would probably be too many for lucene (I think its limit 1024). Matt On Sat, Jan 22, 2011 at 3:40 PM, Dennis Gearon wrote: > The only way that you would have that many api keys per record, is if one > of > them represented 'public', right? 'public' is a ROLE. Your answer is to use > RBAC > style techniques. > > > Here are some links that I have on the subject. What I'm thinking of doing > is: > Sorry for formatting, Firefox is freaking out. I cut and pasted these from > an > email from my sent box. I hope the links came out. > > > Part 1 > > >http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ > >/ > > > Part2 > Role-based access control in SQL, part 2 at Xaprb > > > > > > ACL/RBAC Bookmarks ALL > > UserRbac - symfony - Trac > A Role-Based Access Control (RBAC) system for PHP > Appendix C: Task-Field Access > Role-based access control in SQL, part 2 at Xaprb > PHP Access Control - PHP5 CMS Framework Development | PHP Zone > Linux file and directory permissions > MySQL :: MySQL 5.0 Reference Manual :: C.5.4.1 How to Reset the Root > Password > per RECORD/Entity permissions? - symfony users | Google Groups > Special Topics: Authentication and Authorization | The Definitive Guide to > Yii | > Yii Framework > > att.net Mail (gear...@sbcglobal.net) > Solr - User - Modelling Access Control > PHP Generic Access Control Lists > Row-level Model Access Control for CakePHP « some flot, some jet > Row-level Model Access Control for CakePHP « some flot, some jet > Yahoo! GeoCities: Get a web site with easy-to-use site building tools. > Class that acts as a client to a JSON service : JSON « GWT « Java > Juozas Kaziukėnas devBlog > Re: [symfony-users] Implementing an existing ACL API in symfony > php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow > W3C ACL System > makeAclTables.sql > SchemaWeb - Classes And Properties - ACL Schema > Reardon's Ruminations: Spring Security ACL Schema for Oracle > trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla > Acl.php - kohana-mptt - Project Hosting on Google Code > Asynchronous JavaScript Technology and XML (Ajax) With the Java Platform > The page cannot be found > > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. I
Re: api key filtering
The links didn't work, so here the are again, NOT from a sent folder: PHP Access Control - PHP5 CMS Framework Development | PHP Zone A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb PHP Access Control - PHP5 CMS Framework Development | PHP Zone UserRbac - symfony - Trac A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb UserRbac - symfony - Trac Acl.php - kohana-mptt - Project Hosting on Google Code CANDIDATE-PHP Generic Access Control Lists http://dev.w3.org/perl/modules/W3C/Rnodes/bin/makeAclTables.sql makeAclTables.sql php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow PHP Generic Access Control Lists Reardon's Ruminations: Spring Security ACL Schema for Oracle Re: [symfony-users] Implementing an existing ACL API in symfony SchemaWeb - Classes And Properties - ACL Schema trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla Using Zend_Acl with a database backend - Zend Framework Wiki W3C ACL System Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Matt Mitchell To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 12:50:24 PM Subject: Re: api key filtering Hey thanks I'll definitely have a read. The only problem with this though, is that our api is a thin layer of app-code, with solr only (no db), we index data from our sql db into solr, and push the index off for consumption. The only other idea I had was to send a list of the allowed document ids along with every solr query, but then I'm sure I'd run into a filter query limit. Each key could be associated with up to 2k documents, so that's 2k values in an fq which would probably be too many for lucene (I think its limit 1024). Matt On Sat, Jan 22, 2011 at 3:40 PM, Dennis Gearon wrote: > The only way that you would have that many api keys per record, is if one > of > them represented 'public', right? 'public' is a ROLE. Your answer is to use > RBAC > style techniques. > > > Here are some links that I have on the subject. What I'm thinking of doing > is: > Sorry for formatting, Firefox is freaking out. I cut and pasted these from > an > email from my sent box. I hope the links came out. > > > Part 1 > > >http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ >/ > > > Part2 > Role-based access control in SQL, part 2 at Xaprb > > > > > > ACL/RBAC Bookmarks ALL > > UserRbac - symfony - Trac > A Role-Based Access Control (RBAC) system for PHP > Appendix C: Task-Field Access > Role-based access control in SQL, part 2 at Xaprb > PHP Access Control - PHP5 CMS Framework Development | PHP Zone > Linux file and directory permissions > MySQL :: MySQL 5.0 Reference Manual :: C.5.4.1 How to Reset the Root > Password > per RECORD/Entity permissions? - symfony users | Google Groups > Special Topics: Authentication and Authorization | The Definitive Guide to > Yii | > Yii Framework > > att.net Mail (gear...@sbcglobal.net) > Solr - User - Modelling Access Control > PHP Generic Access Control Lists > Row-level Model Access Control for CakePHP « some flot, some jet > Row-level Model Access Control for CakePHP « some flot, some jet > Yahoo! GeoCities: Get a web site with easy-to-use site building tools. > Class that acts as a client to a JSON service : JSON « GWT « Java > Juozas Kaziukėnas devBlog > Re: [symfony-users] Implementing an existing ACL API in symfony > php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow > W3C ACL System > makeAclTables.sql > SchemaWeb - Classes And Properties - ACL Schema > Reardon's Ruminations: Spring Security ACL Schema for Oracle > trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla > Acl.php - kohana-mptt - Project Hosting on Google Code > Asynchronous JavaScript Technology and XML (Ajax) With the Java Platform > The page cannot be found > > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them > yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > > > > - Original Me
Re: api key filtering
The only way that you would have that many api keys per record, is if one of them represented 'public', right? 'public' is a ROLE. Your answer is to use RBAC style techniques. Here are some links that I have on the subject. What I'm thinking of doing is: Sorry for formatting, Firefox is freaking out. I cut and pasted these from an email from my sent box. I hope the links came out. Part 1 http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ Part2 Role-based access control in SQL, part 2 at Xaprb ACL/RBAC Bookmarks ALL UserRbac - symfony - Trac A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb PHP Access Control - PHP5 CMS Framework Development | PHP Zone Linux file and directory permissions MySQL :: MySQL 5.0 Reference Manual :: C.5.4.1 How to Reset the Root Password per RECORD/Entity permissions? - symfony users | Google Groups Special Topics: Authentication and Authorization | The Definitive Guide to Yii | Yii Framework att.net Mail (gear...@sbcglobal.net) Solr - User - Modelling Access Control PHP Generic Access Control Lists Row-level Model Access Control for CakePHP « some flot, some jet Row-level Model Access Control for CakePHP « some flot, some jet Yahoo! GeoCities: Get a web site with easy-to-use site building tools. Class that acts as a client to a JSON service : JSON « GWT « Java Juozas Kaziukėnas devBlog Re: [symfony-users] Implementing an existing ACL API in symfony php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow W3C ACL System makeAclTables.sql SchemaWeb - Classes And Properties - ACL Schema Reardon's Ruminations: Spring Security ACL Schema for Oracle trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla Acl.php - kohana-mptt - Project Hosting on Google Code Asynchronous JavaScript Technology and XML (Ajax) With the Java Platform The page cannot be found Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Matt Mitchell To: solr-user@lucene.apache.org Sent: Sat, January 22, 2011 11:48:22 AM Subject: api key filtering Just wanted to see if others are handling this in some special way, but I think this is pretty simple. We have a database of api keys that map to "allowed" db records. I'm planning on indexing the db records into solr, along with their api keys in an indexed, non-stored, multi-valued field. Then, to query for docs that belong to a particular api key, they'll be queried using a filter query on api_key. The only concern of mine is that, what if we end up with 100k api_keys? Would it be a problem to have 100k non-stored keys in each document? We have about 500k documents total. Matt
Re: Integrating Surround Query Parser
Sounds to me like you either have to find a way NOT to use a parser that is a child class of: org.apache.solr.search.QParserPlugin (not sure if that's possible), or you have to find out what's wrong with the file. Where did you get it, have you talked to the author? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Ahson Iqbal To: Solr Send Mail Sent: Thu, January 20, 2011 11:24:37 PM Subject: Integrating Surround Query Parser Hi All I want to integrate Surround Query Parser with solr, To do this i have downloaded jar file from the internet and and then pasting that jar file in web-inf/lib and configured query parser in solrconfig.xml as now when i load solr admin page following exception comes org.apache.solr.common.SolrException: Error Instantiating QParserPlugin, org.apache.lucene.queryParser.surround.parser.QueryParser is not a org.apache.solr.search.QParserPlugin what i think that i didnt get the right plugin, can any body guide me from where to get right plugin for surround query parser or how to accurately integrate this plugin with solr. thanx Ahsan
Re: pruning search result with search score gradient
that's a pretty good idea, using 'delta score' Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsen To: "solr-user@lucene.apache.org" Sent: Thu, January 20, 2011 11:31:48 PM Subject: Re: pruning search result with search score gradient On Tue, 2011-01-11 at 12:12 +0100, Julien Piquot wrote: > I would like to be able to prune my search result by removing the less > relevant documents. I'm thinking about using the search score : I use > the search scores of the document set (I assume there are sorted by > descending order), normalise them (0 would be the the lowest value and 1 > the greatest value) and then calculate the gradient of the normalised > scores. The documents with a gradient below a threshold value would be > rejected. As part of experimenting with federated search, this is one approach we'll be trying out to determine which results to discard when merging. > If the scores are linearly decreasing, then no document is rejected. > However, if there is a brutal score drop, then the documents below the > drop are rejected. So if we have the scores 1.0, 0.9, 0.2, 0.15, 0.1, 0.05 then the slopes will be 0.05, 0.4, 0.025, 0.025, 0.025 and with a slope threshold of 0.1, we would discard everything from score 0.2 and below. It makes sense if the scores are linear with the relevance (a document with score 0.8 has double the relevance as one with 0.4). I don't know if they are, so experiments must be made and I fear that this is another demonstration of the inherent problem with quantifying quality. - Toke
Re: Document level security
Would you do that with 1000's of users? How expensive in processor time is it? Have you ever benchmarked it? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Grijesh To: solr-user@lucene.apache.org Sent: Thu, January 20, 2011 11:05:33 PM Subject: Re: Document level security Hi Rok, I have used about 25 ids with OR Operator and its working fine for me.Just Have to Increase the MaxBoolClouse parameter and also have to configure max header size on Servlet container to enable for big query requests. - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Document-level-security-tp2298066p2300117.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document level security
I'm thinking of using something like this: http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ http://www.xaprb.com/blog/2006/08/18/role-based-access-control-in-sql-part-2/ - Original Message From: Dennis Gearon To: solr-user@lucene.apache.org Sent: Thu, January 20, 2011 8:21:02 PM Subject: Re: Document level security I'm not sure how you COULD do searching without having the permissions in the documents. I mentally use the model of unix filesystems, as a starter. Simple, but powerful. If I needed a separate table for permissions, or index, I'd have to do queries, with GINORMOUS amounts of OR statements. I see it flowing like: User U Has Access to Documents DS (40,000,000 out of 100,000,000 of them), Now get these (list of 40x10^06) documents. How do you see it Peter? Dennis Gearon - Original Message From: Peter Sturge To: solr-user@lucene.apache.org Sent: Thu, January 20, 2011 3:16:59 PM Subject: Re: Document level security Hi, One of the things about Document Security is that it never involves just one thing. There are a lot of things to consider, and unfortunately, they're generally non-trivial. Deciding how to store/hold/retrieve permissions is certainly one of those things, and you're right, you should avoid attaching permissions to document data in the index, because if you want to change permissions (and you will want to change them at some point), it can be a cumbersome job, particularly if it involves millions of documents, replication, shards etc. It's also generally a good idea not to tie your schema to permission fields. Another big consideration is authentication - how can you be sure the request is coming from the user you think it is? Is there a certificate involved? Has the user authenticated to the container? If so, how do you get to this? and so on... For permissions storage, there are two realistic approaches to consider: 1. Write a SearchComponent that handles permission requests. This typically involves storing/reading permissions in/from a file, database or separate index (see SOLR-1872) 2. Use an LCF module to retrieve permissions from the original documents themselves (see SOLR-1834) Hope this helps, Peter On Thu, Jan 20, 2011 at 8:44 PM, Rok Rejc wrote: > Hi all, > > I have an index containing a couple of million documents. > Documents are grouped into "groups", each group contains from 1000-2 > documents. > > The problem: > Each group has defined permission settings. It can be viewed by public, > viewed by registred users, or viewed by a list of users (each group has her > own list of users). > Said differently: I need a document security. > > What I read from the other threads it is not recommended to store > permissions in the index. I have already all the permissions in the > database, but I don't "know" how to connect the database and the index. > I can query the database to get the groups in which the user is and after > that do the OR query, but I am afraid that this list can be too big (100 > OR's could also exceeds maximum HTTP GET query string length). > > What are the other options? Should I write a custom collector which will > query (and cache) the database for permissions? > > Any ideas are appreciated... > > Many thanks, Rok >
Re: Document level security
I'm not sure how you COULD do searching without having the permissions in the documents. I mentally use the model of unix filesystems, as a starter. Simple, but powerful. If I needed a separate table for permissions, or index, I'd have to do queries, with GINORMOUS amounts of OR statements. I see it flowing like: User U Has Access to Documents DS (40,000,000 out of 100,000,000 of them), Now get these (list of 40x10^06) documents. How do you see it Peter? Dennis Gearon - Original Message From: Peter Sturge To: solr-user@lucene.apache.org Sent: Thu, January 20, 2011 3:16:59 PM Subject: Re: Document level security Hi, One of the things about Document Security is that it never involves just one thing. There are a lot of things to consider, and unfortunately, they're generally non-trivial. Deciding how to store/hold/retrieve permissions is certainly one of those things, and you're right, you should avoid attaching permissions to document data in the index, because if you want to change permissions (and you will want to change them at some point), it can be a cumbersome job, particularly if it involves millions of documents, replication, shards etc. It's also generally a good idea not to tie your schema to permission fields. Another big consideration is authentication - how can you be sure the request is coming from the user you think it is? Is there a certificate involved? Has the user authenticated to the container? If so, how do you get to this? and so on... For permissions storage, there are two realistic approaches to consider: 1. Write a SearchComponent that handles permission requests. This typically involves storing/reading permissions in/from a file, database or separate index (see SOLR-1872) 2. Use an LCF module to retrieve permissions from the original documents themselves (see SOLR-1834) Hope this helps, Peter On Thu, Jan 20, 2011 at 8:44 PM, Rok Rejc wrote: > Hi all, > > I have an index containing a couple of million documents. > Documents are grouped into "groups", each group contains from 1000-2 > documents. > > The problem: > Each group has defined permission settings. It can be viewed by public, > viewed by registred users, or viewed by a list of users (each group has her > own list of users). > Said differently: I need a document security. > > What I read from the other threads it is not recommended to store > permissions in the index. I have already all the permissions in the > database, but I don't "know" how to connect the database and the index. > I can query the database to get the groups in which the user is and after > that do the OR query, but I am afraid that this list can be too big (100 > OR's could also exceeds maximum HTTP GET query string length). > > What are the other options? Should I write a custom collector which will > query (and cache) the database for permissions? > > Any ideas are appreciated... > > Many thanks, Rok >
Re: unix permission styles for access control
Three-dimensional multi value sounds good. Tough choice on character vs full-length words. Full length os easier & less confusing, but with hopefully millions pd documents in the future, it increasas index size. Sent from Yahoo! Mail on Android
Documentaion: For newbies and recent newbies
If someone is looking for good documentation and getting started guides, I am putting this in the newsgroups to be searched upon. I recommend: A/ The Wikis: (FREE) http://wiki.apache.org/solr/FrontPage B/ The book and eBook: (COSTS $45.89) https://www.packtpub.com/solr-1-4-enterprise-search-server/book C/ The (seemingly) total reference guide:(FREE, with registration) http://www.lucidimagination.com/software_downloads/certified/cdrg/lucidworks-solr-refguide-1.4.pdf D/ The webinar on optimizing the search engine to Do a GOOD search, based on YOUR needs, not general ones: (FREE, with registration) http://www.lucidimagination.com/Solutions/Webinars/Analyze-This-Tips-and-tricks-getting-LuceneSolr-Analyzer-index-and-search-your-content Personally, I am working on being more than barely informed on items A & B :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: unix permission styles for access control
So, if I used something like r-u-d-o in a field (read,update,delete,others) I could get it tokenized to those four characters,and then search for those in that field. Is that what you're suggesting, (thanks by the way). An article I read created a 'hybrid' access control system (can't remember if it was ACL or RBAC). It used a primary system like Unix file system 9bit permission for the primary permissions normally needed on most objects of any kind, and then flagged if there were any other permissions and any other groups. It was very fast for the primary permissons, and fast for the secondary. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Wed, January 19, 2011 8:40:30 AM Subject: Re: unix permission styles for access control No. There is no built in way to address 'bits' in Solr that I am aware of. Instead you can think about how to transform your data at indexing into individual tokens (rather than bits) in one or more field, such that they are capable of answering your query. Solr works in tokens as the basic unit of operation (mostly, basically), not characters or bytes or bits. On 1/19/2011 9:48 AM, Dennis Gearon wrote: > Sorry for repeat, trying to make sure this gets on the newsgroup to 'all'. > > So 'fieldName.x' is how to address bits? > > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > > > > - Original Message > From: Toke Eskildsen > To: "solr-user@lucene.apache.org" > Sent: Wed, January 19, 2011 12:23:04 AM > Subject: Re: unix permission styles for access control > > On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote: >> I was wondering if the are binary operation filters? Haven't seen any in the >> book nor was able to find any using google. >> >> So if I had 0600(octal) in a permission field, and I wanted to return any >> records that 'permission& 0400(octal)==TRUE', how would I filter that? > Don't you mean permission& 0400(octal) == 0400? Anyway, the > functionality can be accomplished by extending your index a bit. > > > You could split the permission into user, group and all parts, then use > an expanded query. > > If the permission is 0755 it will be indexed as > user_p:7 group_p:5 all_p:5 > > If you're searching for something with at least 0650 your query should > be expanded to > (user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5) > > > Alternatively you could represent the bits explicitly in the index: > user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5 > > Then a search for 0650 would query with > user_p:2 AND user_p:4 AND group_p:1 AND group_p:4 > > > Finally you could represent all valid permission values, still split > into parts with > user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7 > group_p:1 group_p:2 group_p:3 group_p:4 group_p:5 > all_p:1 all_p:2 all_p:3 all_p:4 all_p:5 > > The query would be simply > user_p:6 AND group_p:5
Re: unix permission styles for access control
Did some more searching this morning. Perhaps being bleary eyed helpe :-) I found this JIRA which does bitwise boolean operator filtering: https://issues.apache.org/jira/browse/SOLR-1913 I'm not that sure how to interpret JIRA pages for features. It's 'OPEN", but the comments all say it works. So, what's they syntax for combining filters in queries? I am currently using the spatial filter.How would I write a query that combines: http://localhost:8983/path/to/solr/select/?q={!bitwise field=fieldname op=OPERATION_NAME source=sourcevalue negate=boolean}remainder {!spatial lat=37.393026 long=-121.998304 radius=10 unit=km threadCount=3} ts_begin:[1 TO 2145916800] AND text:"find_this" Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsen To: "solr-user@lucene.apache.org" Sent: Wed, January 19, 2011 12:23:04 AM Subject: Re: unix permission styles for access control On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote: > I was wondering if the are binary operation filters? Haven't seen any in the > book nor was able to find any using google. > > So if I had 0600(octal) in a permission field, and I wanted to return any > records that 'permission & 0400(octal)==TRUE', how would I filter that? Don't you mean permission & 0400(octal) == 0400? Anyway, the functionality can be accomplished by extending your index a bit. You could split the permission into user, group and all parts, then use an expanded query. If the permission is 0755 it will be indexed as user_p:7 group_p:5 all_p:5 If you're searching for something with at least 0650 your query should be expanded to (user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5) Alternatively you could represent the bits explicitly in the index: user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5 Then a search for 0650 would query with user_p:2 AND user_p:4 AND group_p:1 AND group_p:4 Finally you could represent all valid permission values, still split into parts with user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7 group_p:1 group_p:2 group_p:3 group_p:4 group_p:5 all_p:1 all_p:2 all_p:3 all_p:4 all_p:5 The query would be simply user_p:6 AND group_p:5
Re: unix permission styles for access control
Sorry for repeat, trying to make sure this gets on the newsgroup to 'all'. So 'fieldName.x' is how to address bits? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsen To: "solr-user@lucene.apache.org" Sent: Wed, January 19, 2011 12:23:04 AM Subject: Re: unix permission styles for access control On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote: > I was wondering if the are binary operation filters? Haven't seen any in the > book nor was able to find any using google. > > So if I had 0600(octal) in a permission field, and I wanted to return any > records that 'permission & 0400(octal)==TRUE', how would I filter that? Don't you mean permission & 0400(octal) == 0400? Anyway, the functionality can be accomplished by extending your index a bit. You could split the permission into user, group and all parts, then use an expanded query. If the permission is 0755 it will be indexed as user_p:7 group_p:5 all_p:5 If you're searching for something with at least 0650 your query should be expanded to (user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5) Alternatively you could represent the bits explicitly in the index: user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5 Then a search for 0650 would query with user_p:2 AND user_p:4 AND group_p:1 AND group_p:4 Finally you could represent all valid permission values, still split into parts with user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7 group_p:1 group_p:2 group_p:3 group_p:4 group_p:5 all_p:1 all_p:2 all_p:3 all_p:4 all_p:5 The query would be simply user_p:6 AND group_p:5
Re: unix permission styles for access control
so fieldName.x ishow to address bits? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsen To: "solr-user@lucene.apache.org" Sent: Wed, January 19, 2011 12:23:04 AM Subject: Re: unix permission styles for access control On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote: > I was wondering if the are binary operation filters? Haven't seen any in the > book nor was able to find any using google. > > So if I had 0600(octal) in a permission field, and I wanted to return any > records that 'permission & 0400(octal)==TRUE', how would I filter that? Don't you mean permission & 0400(octal) == 0400? Anyway, the functionality can be accomplished by extending your index a bit. You could split the permission into user, group and all parts, then use an expanded query. If the permission is 0755 it will be indexed as user_p:7 group_p:5 all_p:5 If you're searching for something with at least 0650 your query should be expanded to (user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5) Alternatively you could represent the bits explicitly in the index: user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5 Then a search for 0650 would query with user_p:2 AND user_p:4 AND group_p:1 AND group_p:4 Finally you could represent all valid permission values, still split into parts with user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7 group_p:1 group_p:2 group_p:3 group_p:4 group_p:5 all_p:1 all_p:2 all_p:3 all_p:4 all_p:5 The query would be simply user_p:6 AND group_p:5
unix permission styles for access control
I was wondering if the are binary operation filters? Haven't seen any in the book nor was able to find any using google. So if I had 0600(octal) in a permission field, and I wanted to return any records that 'permission & 0400(octal)==TRUE', how would I filter that? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Indexing and Searching Chinese with SolrNet
Make sure your browser is set to UTF-8 encoding. - Original Message From: Otis Gospodnetic To: solr-user@lucene.apache.org; bing...@asu.edu Sent: Tue, January 18, 2011 10:39:16 AM Subject: Re: Indexing and Searching Chinese with SolrNet Bing Li, Go to your Solr Admin page and use the Analysis functionality there to enter some Chinese text and see how it's getting analyzed at index and at search time. This will tell you what is (or isn't) going on. Here it looks like you just defined index-time analysis, so you should see your index-time analysis look very different from your query-time analysis. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Bing Li > To: solr-user@lucene.apache.org > Sent: Tue, January 18, 2011 1:30:37 PM > Subject: Indexing and Searching Chinese with SolrNet > > Dear all, > > After reading some pages on the Web, I created the index with the following > schema. > > .. > positionIncrementGap="100"> > >class="solr.ChineseTokenizerFactory"/> > > > .. > > It must be correct, right? However, when sending a query though SolrNet, no > results are returned. Could you tell me what the reason is? > > Thanks, > LB >
Re: Solr UUID field for externally generated UUIDs
THX, Chris! Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Chris Hostetter To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 11:35:38 AM Subject: Re: Solr UUID field for externally generated UUIDs : : : The above won't generate a UUID on it's own, right? correct. -Hoss
Solr UUID field for externally generated UUIDs
I would like to use the following field declaration to store my own, COMB UUIDs, (same length and format, a kind of cross between version 1 and version 4). If I leave out default value in the declaration, would that work? I.E.: The above won't generate a UUID on it's own, right? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Does Solr supports indexing & search for Hebrew.
Whoops, picked the wrong email to reply thanks to. Wasn't actually in this thread. Dennis Gearon - Original Message From: Dennis Gearon To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 8:25:04 AM Subject: Re: Does Solr supports indexing & search for Hebrew. Thanks Ofer :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Ofer Fort To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 4:55:53 AM Subject: Re: Does Solr supports indexing & search for Hebrew. take a look at : http://github.com/synhershko/HebMorph with more info at http://www.code972.com/blog/hebmorph/ On Tue, Jan 18, 2011 at 11:04 AM, prasad deshpande < prasad.deshpand...@gmail.com> wrote: > Hello, > > With reference to below links I haven't found Hebrew support in Solr. > > http://wiki.apache.org/solr/LanguageAnalysis > > http://lucene.apache.org/java/3_0_3/api/all/index.html > > If I want to index and search Hebrew files/data then how would I achieve > this? > > Thanks, > Prasad >
Re: Does Solr supports indexing & search for Hebrew.
Thanks Ofer :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Ofer Fort To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 4:55:53 AM Subject: Re: Does Solr supports indexing & search for Hebrew. take a look at : http://github.com/synhershko/HebMorph with more info at http://www.code972.com/blog/hebmorph/ On Tue, Jan 18, 2011 at 11:04 AM, prasad deshpande < prasad.deshpand...@gmail.com> wrote: > Hello, > > With reference to below links I haven't found Hebrew support in Solr. > > http://wiki.apache.org/solr/LanguageAnalysis > > http://lucene.apache.org/java/3_0_3/api/all/index.html > > If I want to index and search Hebrew files/data then how would I achieve > this? > > Thanks, > Prasad >
Re: just got 'the book' already have a question
Thanks Robert. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Robert Muir To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 1:40:25 AM Subject: Re: just got 'the book' already have a question On Mon, Jan 17, 2011 at 11:10 PM, Dennis Gearon wrote: > First of all, seems like a good book, > > Solr-14-Enterprise-Search-Server.pdf > > Question, is it possible to choose locale at search time? So if my customer is > querying across cultural/national/linguistic boundaries and I have the data for > him different languages in the same index, can I sort based on his language? > http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_for_multiple_languages
Re: NRT
Thanks Otis Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Otis Gospodnetic To: solr-user@lucene.apache.org Sent: Mon, January 17, 2011 11:15:23 PM Subject: Re: NRT Hi, > How is NRT doing, being used in production? > Which Solr is it in? Unless I missed it, I don't think there is true NRT in Solr just yet. > And is there built in Spatial in that version? > > How is Solr 4.x doing? Well :) 3 ways to know this sort of stuff: * follow the dev list - high volume * subscribe to Sematext Blog - we publish monthly Solr Digests * check JIRA to see how many issues remain to be fixed Otis -- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
explicit field type descriptions
Is there any tabular data anywhere on ALL field types and ALL options? For example, I've looked everywhere in the last hour, and I don't see anywhere on Solr site, google, or in the 1.4 manual where it says whether a copyField 'directive' can be made ' required="true" '. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
just got 'the book' already have a question
First of all, seems like a good book, Solr-14-Enterprise-Search-Server.pdf Question, is it possible to choose locale at search time? So if my customer is querying across cultural/national/linguistic boundaries and I have the data for him different languages in the same index, can I sort based on his language? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
NRT
How is NRT doing, being used in production? Which Solr is it in? And is there built in Spatial in that version? How is Solr 4.x doing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: use of schema.xml
I could put 1-10,000 fileds in any one document, as long as they are told what type or they are dynamically matched by dynamic fields relative to what's in the schema.xml file? It's very much like google 'big tables' or 'elastic search' that way, right? It's up to me to enforce any field names or quantities and assign field types during insert/update? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Thu, January 13, 2011 8:16:54 PM Subject: Re: use of schema.xml Wait- it does enforce the schema names. What it does not enforce is field contents when you change the schema. Since Lucene does not have field replacement, it is not practical to remove or add a field to all existing documents when you change the schema. On Thu, Jan 13, 2011 at 8:15 PM, Lance Norskog wrote: > Correct. Solr and Lucene do not store or enforce the schema. You're on > your own :) > > On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon wrote: >> I'm going to buy the book for Solr, since it looks like I need to do more of >>the >> work than I thought I would. >> >> But, from looking at it, the schema file only says: >> >> A/ What types of data can be in the 'fields' of the documents >> B/ If there are any dynamically assigned fields. >> C/ What parsers are available >> D/ other stuff. >> >> And what it DOESN'T do is set the 'schema' for the index, right? >> (like DDL for a database does) >> >> Dennis Gearon >> >> >> Signature Warning >> >> It is always a good idea to learn from your own mistakes. It is usually a >>better >> idea to learn from others’ mistakes, so you do not have to make them yourself. >> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >> >> >> EARTH has a Right To Life, >> otherwise we all die. >> >> > > > > -- > Lance Norskog > goks...@gmail.com > -- Lance Norskog goks...@gmail.com
use of schema.xml
I'm going to buy the book for Solr, since it looks like I need to do more of the work than I thought I would. But, from looking at it, the schema file only says: A/ What types of data can be in the 'fields' of the documents B/ If there are any dynamically assigned fields. C/ What parsers are available D/ other stuff. And what it DOESN'T do is set the 'schema' for the index, right? (like DDL for a database does) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: start value in queries zero or one based?
I'm migrating to CTO/CEO status in life due to building a small company. I find I don't have too much time for theory. I work with wht is. So, what is it, not what should it be. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Walter Underwood To: solr-user@lucene.apache.org Sent: Thu, January 13, 2011 1:38:26 PM Subject: Re: start value in queries zero or one based? On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote: > Do I even need a body for this message? ;-) > > Dennis Gearon Are you asking "is it" or "should it be"? If the latter, we can also discuss Emacs and vi. wunder -- Walter Underwood K6WRU
start value in queries zero or one based?
Do I even need a body for this message? ;-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
basic document crud in an index
OK, getting ready to be more intereactive with my index, (she likes me). These are pretty much boolean answered questions to help my understanding. I think having these in the mail list records might help other too. A/ Is there a query that updates all the fields automatically on a record that has a unique id? B/ Does it leave the old document and new document in the index? C/ Will a query immedialty following see both documents? D/ Merging does not get rid of any old documents if there are any, but optimize does? E/ Is optimize invoked on the whole index, not individual segments? Thanks for a great product, ya'll. I have a 64K document index, small by many standards. But I did a search on it for a test, and started at row 16,000 of the results (broad results), and almost not noticeably slower than starting at 0. And it's on the lowest cost Amazon server that will run it. Of course, no one but me is hitting that box yet :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Exciting Solr Use Cases
When I have it running with a permission system (through both API and front end), I will share i with everyone. It's beginning tohappen. The search if fairly primative for now. But we hope to learn or hire skills ot better match it to the business model as we grow/get funding. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Peter Karich To: solr-user@lucene.apache.org Sent: Wed, January 12, 2011 3:37:12 PM Subject: Exciting Solr Use Cases Hi all! Would you mind to write about your Solr project if it has an uncommon approach or if it is somehow exciting? I would like to extend my list for a new blog post. Examples I have in mind at the moment are: loggly (real time + big index), solandra (nice solr + cassandra combination), haiti trust (extrem index size), ... Kind Regards, Peter.
Re: PHP app not communicating with Solr
I was unable to get it to compile. From the author, got one reply about the benefits of the compiled version. After submitting my errors to him, have not yet received a reply. ##Weird thing 'on the way to the forum' today.## I remember reading an article a couple of days ago which said the compiled version is 10-15% faster than the 'pure PHP' Solr library out there, (and it has a lot more capability,that's for sure!) Turns out, this slower pure PHP version uses 'file_get_contents()'(FCG) to do the actual query of the Solr Instance. http://stackoverflow.com/questions/23/file-get-contents-vs-curl-what-has-better-performance The article above shows that FCG is on average 22% slower than using cURL in basic usage. so modifying the 'pure PHP' library with cURL would make up for all of the speed that the compiled SolrPHP has. Dennis Gearon - Original Message From: Lukas Kahwe Smith To: solr-user@lucene.apache.org Sent: Wed, January 12, 2011 2:52:46 PM Subject: Re: PHP app not communicating with Solr On 12.01.2011, at 23:50, Eric wrote: > Web page returns the following message: > Fatal error: Uncaught exception 'Exception' with message '"0" Status: >Communication Error' > > This happens in a dev environment, everything on one machine: Windows 7, > WAMP, >CakePHP, Tomcat, Solr, and SolrPHPClient. Error message also references line >334 >of the Service.php file, which is part of the SolrPHPClient. > > Everything works perfectly on a different machine so this problem is probably >related to configuration. On the problem machine, I can reach solr at >http://localhost:8080/solr/admin and it looks correct (AFAIK). I am >documenting >the setup procedures this time around but don't know what's different between >the two machines. > > Google search on the error message shows the message is not uncommon so the >answer might be helpful to others as well. I ran into this issue compiling PHP with--curl-wrappers. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: Solr trunk for production
What's the syntax for spatial for that version of Solr? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Ron Mayer To: solr-user@lucene.apache.org Sent: Wed, January 12, 2011 7:18:10 AM Subject: Re: Solr trunk for production Otis Gospodnetic wrote: > Are people using Solr trunk in serious production environments? I suspect > the > answer is yes, just want to see if there are any gotchas/warnings. Yes, since it seemed the best way to get edismax with this patch[1]; and to get the more update-friendly MergePolicy[2]. Main gotcha I noticed so far is trying to figure out appropriate times to sync with trunk's newer patches; and whether or not we need to rebuild our kinda big (> 1TB) indexes when we do. [1] the patch I needed: https://issues.apache.org/jira/browse/SOLR-2058 [2] nicer MergePolicy https://issues.apache.org/jira/browse/LUCENE-2602
Re: issue with the spatial search with solr
You didn't happen to notice that you have one field names RestaurantLocation and another named RestaurantName, did you? You must be submitting 'RestaurantName', and it's being applied to a geo field. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: ur lops To: solr-user@lucene.apache.org Sent: Tue, January 11, 2011 11:13:36 PM Subject: issue with the spatial search with solr Hi, I took the latest build from the hudson and installed on my computer. I have done the following changes in my schema.xml When i run the query like this: HTTP ERROR 500 Problem accessing /solr/select. Reason: The field restaurantName does not support spatial filtering org.apache.solr.common.SolrException: The field restaurantName does not support spatial filtering at org.apache.solr.search.SpatialFilterQParser.parse(SpatialFilterQParser.java:86) at org.apache.solr.search.QParser.getQuery(QParser.java:143) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:112) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:210) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1296) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) This is my solr query: select?wt=json&indent=true&fl=name,store&q=*:*&fq={!geofilt%20sfield=restaurantName}&pt=45.15,-93.85&d=5 Any help will be highly appreciated. Thanks
Re: Input raw log file
A possible shortcut? Write a regex that will parse out the fields as you want them, put that into some shell script that calls solr? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Grijesh.singh To: solr-user@lucene.apache.org Sent: Tue, January 11, 2011 10:46:20 PM Subject: Re: Input raw log file First thing is that your raw log files solr can not understand. Solr needs data according to schema defined And also solr does not know your log file format . So you have to write a parser program that will parse your log files into a existing solr writable formats .Then you can be able to index that data. - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Input-raw-log-file-tp2210043p2239548.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Solr instances common core possible ?
NOT sure about any of it, but THINK that READ ONLY, with one solr instance doing writes is possible. I've heard that it's NEVER possible to do multiple Solr Instances writing. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Ravi Kiran To: solr-user@lucene.apache.org Sent: Tue, January 11, 2011 9:15:06 AM Subject: Multiple Solr instances common core possible ? Hello, Is it possible to deploy multiple solr instances with different context roots pointing to the same solr core ? If I do this will there be any deadlocks or file handle issues ? The reason I need this setup is because I want to expose solr to an third party vendor via a different context root. My solr instance is deployed on Glassfish. Alternately, if there is a configurable way to setup multiple context roots for the same solr instance that will suffice at this point of time. Ravi Kiran
How to insert this using Solr PHP?
I am switching between building the query to a Solr instance by hand and doing it with PHP Solr Extension. I have this query that my dev partner said to insert before all the other column searches. What kind of query is it and how do I get it into the query in an 'OOP' style using the PHP Solr extension? In particular, I'm interested in what is the part in the query 'q={!.}. Is that a filter query? How do I put it into the query . . . I already asked that ;-) URL_BASE?wt=json&indent=true&start=0&rows=20&q={!spatial lat=xx.x long=xxx.x radius=10 unit=km threadCount=3} OTHER COLUMNS, blah blah bcc: my partner Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
icq or other 'instant gratification' communication forums for Solr
Are there any chatrooms or ICQ rooms to ask questions late at night to people who stay up or are on other side of planet? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Improving Solr performance
What I seem to see suggested here is to use different cores for the things you suggested: different types of documents Access Control Lists I wonder how sharding would work in that scenario? Me, I plan on : For security: Using a permissions field For different schmas: Dynamic fields with enough premade fields to handle it. The one thing I don't thing my approach does well with is statistics. Dennis Gearon - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Cc: supersoft Sent: Mon, January 10, 2011 1:08:00 PM Subject: Re: Improving Solr performance I see a lot of people using shards to hold "different types of documents", and it almost always seems to be a bad solution. Shards are intended for distributing a large index over multiple hosts -- that's it. Not for some kind of federated search over multiple schemas, not for access control. Why not put everything in the same index, without shards, and just use an 'fq' limit in order to limit to the specific document you'd like to search over in a given search?I think that would achieve your goal a lot more simply than shards -- then you use sharding only if and when your index grows to be so large you'd like to distribute it over multiple hosts, and when you do so you choose a shard key that will have more or less equal distribution accross shards. Using shards for access control or schema management just leads to headaches. [Apparently Solr could use some highlighted documentation on what shards are really for, as it seems to be a very common issue on this list, someone trying to use them for something else and then inevitably finding problems with that approach.] Jonathan On 1/7/2011 6:48 AM, supersoft wrote: > The reason of this distribution is the kind of the documents. In spite of > having the same schema structure (and solr conf), a document belongs to 1 of > 5 different kinds. > > Each kind corresponds to a concrete shard and due to this, the implemented > client tool avoids searching in all the shards when the users selects just > one or a few of kinds. The tool runs a multisharded query of the proper > shards. I guess this is a right approach but correct me if I am wrong. > > The real problem of this architecture is the correlation between concurrent > users and response time: > 1 query: n seconds > 2 queries: 2*n second each query > 3 queries: 3*n seconds each query > and so... > > This is being a real headache because 1 single query has an acceptable > response time but when many users are accessing to the server the > performance goes hardly down.
Re: Box occasionally pegs one cpu at 100%
One other possiblity is that the OS or BIOS is doing that, at least on a laptop. There is a new feature where, if the load is low enough, non multi threaded applications can be assigned to one processor and that processor has it's clock boosted so the older software will run faster on the new processors - Otherwise they run SLOWER!. My brother has a cad program that runs slower on his new quad core because the base clock speed is slower than a single processor CPU. The software company is not taking the time to rewrite their code, excpet where they add features or fixes. - Original Message From: Brian Burke To: "solr-user@lucene.apache.org" Sent: Mon, January 10, 2011 10:56:27 AM Subject: Re: Box occasionally pegs one cpu at 100% This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM. Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote: > I have a fairly classic master/slave set up. > > Response times on the slave are generally good with blips periodically, > apparently when replication is happening. > > Occasionally however the process will have one incredibly slow query and > will peg the CPU at 100%. > > The weird thing is that it will remain that way even if we stop querying > it and stop replication and then wait for over 20 minutes. The only way > to fix the problem at that point is to restart tomcat. > > Looking at slow queries around the time of the incident they don't look > particularly bad - they're predominantly filter queries running under > dismax and there doesn't seem to be anything unusual about them. > > The index file is about 266G and has 30G of disk free. The machine has > 50G of RAM and is running with -Xmx35G. > > Looking at the processes running it appears to be the main Java thread > that's CPU bound, not the child threads. > > Stracing the process gives a lot of brk instructions (presumably some > sort of wait loop) with occasional blips of: > > > mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 > futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, > 325, {1294683789, 614186000}, ) = 0 > futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 > mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 > mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 > futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 > mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 > futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 > futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 > mmap(0x7fc2e023, 121962496, PROT_NONE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = > 0x7fc2e023 > mmap(0x7fbca58e, 237568, PROT_NONE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = > 0x7fbca58e > > Any ideas about what's happening and if there's anyway to mitigate it? > If the box at least recovered then I could run another slave and load > balance between them working on the principle that the second box > would pick up the slack whilst the first box restabilised but, as it is, > that's not reliable. > > Thanks, > > Simon >
Re: How to let crawlers in, but prevent their damage?
H, so if someone says they have SEO skills on their resume, they COULD be talking about optimizing the SEARH engnie at some site, not just a web site to be crawled by search engines? - Original Message From: Ken Krugler To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 9:07:43 AM Subject: Re: How to let crawlers in, but prevent their damage? On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote: > Hi Ken, thanks Ken. :) > > The problem with this approach is that it exposes very limited content to > bots/web search engines. > > Take http://search-lucene.com/ for example. People enter all kinds of queries > in web search engines and end up on that site. People who visit the site > directly don't necessarily search for those same things. Plus, new terms are > entered to get to search-lucene.com every day, so keeping up with that would > mean constantly generating more and more of those static pages. Basically, the > tail is super long. To clarify - the issue of using actual user search traffic is one of SEO, not what content you expose. If, for example, people commonly do a search for "java " then that's a hint that the URL to the static content, and the page title, should have the language as part of it. So you shouldn't be generating static pages based on search traffic. Though you might want to decide what content to "favor" (see below) based on popularity. > On top of that, new content is constantly being generated, > so one would have to also constantly both add and update those static pages. Yes, but that's why you need to automate that content generation, and do it on a regular (e.g. weekly) basis. The big challenges we ran into were: 1. Dealing with badly behaved bots that would hammer the site. We wound up putting this content on a separate system, so it wouldn't impact users on the main system. And generating a regular report by user agent & IP address, so that we could block by robots.txt and IP when necessary. 2. Figuring out how to structure the static content so that it didn't look like spam to Google/Yahoo/Bing You don't want to have too many links per page, or too much depth, but that constrains how many pages you can reasonably expose. We had project scores based on code, activity, usage - so we used that to rank the content and focus on exposing early (low depth) the "good stuff". You could do the same based on popularity, from search logs. Anyway, there's a lot to this topic, but it doesn't feel very Solr specific. So apologies for reducing the signal-to-noise ratio with talk about SEO :) -- Ken > I have a feeling there is not a good solution for this because on one hand > people don't like the negative bot side effect, on the other hand people want >as > much of their sites indexed by the big guys. The only half-solution that comes > to mind involves looking at who's actually crawling you and who's bringing you > visitors, then blocking those with a bad ratio of those two - bots that crawl a > lot but don't bring a lot of value. > > Any other ideas? > > Thanks, > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message >> From: Ken Krugler >> To: solr-user@lucene.apache.org >> Sent: Mon, January 10, 2011 9:43:49 AM >> Subject: Re: How to let crawlers in, but prevent their damage? >> >> Hi Otis, >> >> From what I learned at Krugle, the approach that worked for us was: >> >> 1. Block all bots on the search page. >> >> 2. Expose the target content via statically linked pages that are separately >> generated from the same backing store, and optimized for target search terms >> (extracted from your own search logs). >> >> -- Ken >> >> On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: >> >>> Hi, >>> >>> How do people with public search services deal with bots/crawlers? >>> And I don't mean to ask how one bans them (robots.txt) or slow them down >> (Delay >>> stuff in robots.txt) or prevent them from digging too deep in search >> results... >>> >>> What I mean is that when you have publicly exposed search that bots crawl, >> they >>> issue all kinds of crazy "queries" that result in errors, that add noise to >> Solr >>> caches, increase Solr cache evictions, etc. etc. >>> >>> Are there some known recipes for dealing with them, minimizing their >> negative >>> side-effects, while still letting them crawl you? >>> >>> Thanks, >>> Otis >>> >>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >>> Lucene ecosystem search :: http://search-lucene.com/ >>> >> >> -- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> e l a s t i c w e b m i n i n g >> >> >> >> >> >> -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: PHP PECL solr API library
Yeah, it doesn't look like an easy, CRUD based interface. - Original Message From: Lukas Kahwe Smith To: solr-user@lucene.apache.org Sent: Sun, January 9, 2011 11:33:16 PM Subject: Re: PHP PECL solr API library On 10.01.2011, at 08:16, Dennis Gearon wrote: > Anyone have any experience using this library? > > http://us3.php.net/solr > Yeah. it works quite well. However imho the API is a maze. Also its lacking critical stuff like escaping and nice to have stuff like lucene query parsing/rewriting. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: How to let crawlers in, but prevent their damage?
- Original Message From: lee carroll To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 6:48:12 AM Subject: Re: How to let crawlers in, but prevent their damage? Sorry not an answer but a +1 vote for finding out best practice for this. Related to it is DOS attacks. We have rewrite rules in between the proxy server and solr which attempts to filter out undesriable stuff but would it be better to have a query app doing this? any standard rewrite rules which drop invalid or potentially malicious queries would be very nice :- What exactly are milicious queries? (besides scraping) What's the problem with invalid queries? Unless someone is doing a custom crawl/scraping of your site, how are they going to issue queries that aren't alread on the site as URLs? On 10 January 2011 13:41, Otis Gospodnetic wrote: > Hi, > > How do people with public search services deal with bots/crawlers? > And I don't mean to ask how one bans them (robots.txt) or slow them down > (Delay > stuff in robots.txt) or prevent them from digging too deep in search > results... > > What I mean is that when you have publicly exposed search that bots crawl, > they > issue all kinds of crazy "queries" that result in errors, that add noise to > Solr > caches, increase Solr cache evictions, etc. etc. > > Are there some known recipes for dealing with them, minimizing their > negative > side-effects, while still letting them crawl you? > > Thanks, > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > >
Re: How to let crawlers in, but prevent their damage?
I don't nkow about stopping proble3ms with the issues that you've raised. But I do know that web sites that aren't indempotent with GET requests are in a hurt locket. That seems to be WAY too many of them. This means, don't do anything with GET that changes the contents of your web site. Regarding a more dierct answer to your question, you'd probably have to have some sort of filtering applied. And anyway, crawlers only issue 'queries' based on the URLs found in the site, right? So are you going to have wierd URLs embedded in your site? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Otis Gospodnetic To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 5:41:17 AM Subject: How to let crawlers in, but prevent their damage? Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy "queries" that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Improving Solr performance
These are definitely server grade machines. There aren't any desktops I know of (that aren't made for HD video editing/rendition) that ever need that kind of memory. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Shawn Heisey To: solr-user@lucene.apache.org Sent: Sun, January 9, 2011 4:34:08 PM Subject: Re: Improving Solr performance On 1/7/2011 2:57 AM, supersoft wrote: > have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs > shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5 > has 11915639 docs Indexes total size: 100GB > > The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I > run the server using Jetty (from Solr example download) with: java -Xmx3024M > -Dsolr.solr.home=multicore -jar start.jar > > The response time for a query is around 2-3 seconds. Nevertheless, if I > execute several queries at the same time the performance goes down > inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469 > ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484, > 7203, 7719, 7781 ms... I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Alternatively, you could use SSD to store the indexes instead of spinning hard drives, or put each shard on its own physical machine with RAM appropriately sized for the index. For shard5 on its own machine, at 64GB index size, you might be able to get away with 32GB, but ideally you'd want 48-64GB. Can you do anything to reduce the index size? Perhaps you are storing fields that you don't need to be returned in the search results. Ideally, you should only include enough information to fully populate a search results grid, and retrieve detail information for an individual document from the original data source instead of Solr. Thanks, Shawn
Re: (FQ) Filter Query Caching Differences with OR and AND?
And the sky is blue and the night is black - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Wed, January 5, 2011 2:18:20 PM Subject: Re: (FQ) Filter Query Caching Differences with OR and AND? Um, good or bad for what? It depends. But it's how Solr works either way. On 1/5/2011 5:10 PM, Dennis Gearon wrote: > Is that good or bad? > > Dennis Gearon > > > > > - Original Message > From: Jonathan Rochkind > To: "solr-user@lucene.apache.org" > Cc: Em > Sent: Wed, January 5, 2011 1:53:23 PM > Subject: Re: (FQ) Filter Query Caching Differences with OR and AND? > > Each 'fq' clause is it's own cache key. > > 1. fq=foo:bar OR foo:baz > => one entry in filter cache > > 2. fq=foo:bar&fq=foo:baz > => two entries in filter cache, will not use cached entry from #1 > > 3. fq=foo:bar > => One entry, will use cached entry from #2 > > 4. fq=foo:bar >=> One entry, will use cached entry from #2. > > So if you do queries in succession using each of those four fq's in order, you > will wind up with 3 entries in the cache. > > Note that "fq=foo:bar OR foo:baz" is not semantically identical to > "fq=foo&fq=bar". Rather that latter is semantically identical to "fq=foo:bar > AND foo:baz". But "fq=foo&fq=bar" will be two cache entries, and "fq=foo:bar > AND foo:baz" will be one cache entry, and the two won't share any cache >entries. > > > On 1/5/2011 3:17 PM, Em wrote: >> Hi, >> >> while reading through some information on the list and in the wiki, i found >> out that something is missing: >> >> When I specify a filter queries like this >> >> fq=foo:bar OR foo:baz >> or >> fq=foo:bar&fq=foo:baz >> or >> fq=foo:bar >> or >> fq=foo:baz >> >> How many filter query entries will be cached? >> Two, since there are two filters (foo:bar, foo:baz) or 3, since there are >> three different combinations (foo:bar OR foo:baz, foo:bar, foo:baz)? >> >> Thank you! >
Re: (FQ) Filter Query Caching Differences with OR and AND?
Is that good or bad? Dennis Gearon - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Cc: Em Sent: Wed, January 5, 2011 1:53:23 PM Subject: Re: (FQ) Filter Query Caching Differences with OR and AND? Each 'fq' clause is it's own cache key. 1. fq=foo:bar OR foo:baz => one entry in filter cache 2. fq=foo:bar&fq=foo:baz => two entries in filter cache, will not use cached entry from #1 3. fq=foo:bar => One entry, will use cached entry from #2 4. fq=foo:bar => One entry, will use cached entry from #2. So if you do queries in succession using each of those four fq's in order, you will wind up with 3 entries in the cache. Note that "fq=foo:bar OR foo:baz" is not semantically identical to "fq=foo&fq=bar". Rather that latter is semantically identical to "fq=foo:bar AND foo:baz". But "fq=foo&fq=bar" will be two cache entries, and "fq=foo:bar AND foo:baz" will be one cache entry, and the two won't share any cache entries. On 1/5/2011 3:17 PM, Em wrote: > Hi, > > while reading through some information on the list and in the wiki, i found > out that something is missing: > > When I specify a filter queries like this > > fq=foo:bar OR foo:baz > or > fq=foo:bar&fq=foo:baz > or > fq=foo:bar > or > fq=foo:baz > > How many filter query entries will be cached? > Two, since there are two filters (foo:bar, foo:baz) or 3, since there are > three different combinations (foo:bar OR foo:baz, foo:bar, foo:baz)? > > Thank you!
Re: uuid, COMB uuid, distributed farms
Right, Lance, I meant in the field defintion. I appreciate your help and direction. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Tue, January 4, 2011 7:15:07 PM Subject: Re: uuid, COMB uuid, distributed farms 'NOT NULL' in the schema is 'required=true' in a element. 'Search for NOT NULL' is a little odd: you search for a range and then negate the search, meaning for documents with nothing in that field. This standard query does it: -field:[* TO *] On Tue, Jan 4, 2011 at 2:49 PM, Dennis Gearon wrote: > Thanks Lance. > > I will be generating the COMB style of UUID external to Solr. > Prevents a lot of index paging during INSERTS on DBs, maby eSolr too. > > So I would not use 'NEW' in the following, right? > Just leave default out? > Some sort of NOT NULL available in a Solr Schema? > > > PHP code to make the COMB style of UUID, > easily adapted to other languages, some solutions already exist: > > > //requires php5_uuid module in PHP > function make_comb_uuid(){ > uuid_create(&$v4); > uuid_make($v4, UUID_MAKE_V4); > uuid_export($v4, UUID_FMT_STR, &$v4String); > $var=gettimeofday(); > return > substr($v4String,0,24).substr(dechex($var['sec'].$var['usec']),0,12); > > } > > > > Dennis Gearon > > > > > - Original Message > From: Lance Norskog > To: solr-user@lucene.apache.org > Sent: Tue, January 4, 2011 2:15:32 PM > Subject: Re: uuid, COMB uuid, distributed farms > > http://wiki.apache.org/solr/UniqueKey > > On Mon, Jan 3, 2011 at 6:55 PM, pankaj bhatt wrote: >> HI Dennis, >> I have used UUID in context of an application where an installation id >> (UUID) is generated by the code. It caters to around 10K users. >> I have not used it in context of SOLR. >> >> / Pankaj Bhatt. >> >> On Mon, Jan 3, 2011 at 11:05 PM, Dennis Gearon wrote: >> >>> Thank you Pankaj. >>> >>> How large was your installation of Solr? I'm hoping to get mine to be >>> multinational and making plans for that as I go. So having unique ids, >>> UUIDs, >>> that cover a huge addressable space is a requirement. >>> >>> If your's was comparable, how were your replication issues, merging issues, >>> anthing else related to getting large datasets searchable and unique? >>> >>> Dennis Gearon >>> >>> >>> Signature Warning >>> >>> It is always a good idea to learn from your own mistakes. It is usually a >>> better >>> idea to learn from others’ mistakes, so you do not have to make them >>> yourself. >>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >>> >>> >>> EARTH has a Right To Life, >>> otherwise we all die. >>> >>> >>> >>> - Original Message >>> From: pankaj bhatt >>> To: solr-user@lucene.apache.org; gear...@sbcglobal.ne >>> Sent: Mon, January 3, 2011 8:55:21 AM >>> Subject: Re: uuid, COMB uuid, distributed farms >>> >>> Hi Dennis, >>> >>>I have used UUID's in my project to identify a basic installation of >>> the client. >>>Can i be of any help. >>> >>> / Pankaj Bhatt. >>> >>> On Mon, Jan 3, 2011 at 3:28 AM, Dennis Gearon >>> wrote: >>> >>> > Planning ahead here. >>> > >>> > Anyone have experience with UUIDs, COMB UUIDs (sequential) in large, >>> > internatiionally distributed Solr/Database project. >>> > >>> > Dennis Gearon >>> > >>> > >>> > Signature Warning >>> > >>> > It is always a good idea to learn from your own mistakes. It is usually a >>> > better >>> > idea to learn from others’ mistakes, so you do not have to make them >>> > yourself. >>> > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >>> > >>> > >>> > EARTH has a Right To Life, >>> > otherwise we all die. >>> > >>> > >>> >>> >> > > > > -- > Lance Norskog > goks...@gmail.com > > -- Lance Norskog goks...@gmail.com
Re: uuid, COMB uuid, distributed farms
Thanks Lance. I will be generating the COMB style of UUID external to Solr. Prevents a lot of index paging during INSERTS on DBs, maby eSolr too. So I would not use 'NEW' in the following, right? Just leave default out? Some sort of NOT NULL available in a Solr Schema? PHP code to make the COMB style of UUID, easily adapted to other languages, some solutions already exist: //requires php5_uuid module in PHP function make_comb_uuid(){ uuid_create(&$v4); uuid_make($v4, UUID_MAKE_V4); uuid_export($v4, UUID_FMT_STR, &$v4String); $var=gettimeofday(); return substr($v4String,0,24).substr(dechex($var['sec'].$var['usec']),0,12); } Dennis Gearon - Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Tue, January 4, 2011 2:15:32 PM Subject: Re: uuid, COMB uuid, distributed farms http://wiki.apache.org/solr/UniqueKey On Mon, Jan 3, 2011 at 6:55 PM, pankaj bhatt wrote: > HI Dennis, > I have used UUID in context of an application where an installation id > (UUID) is generated by the code. It caters to around 10K users. > I have not used it in context of SOLR. > > / Pankaj Bhatt. > > On Mon, Jan 3, 2011 at 11:05 PM, Dennis Gearon wrote: > >> Thank you Pankaj. >> >> How large was your installation of Solr? I'm hoping to get mine to be >> multinational and making plans for that as I go. So having unique ids, >> UUIDs, >> that cover a huge addressable space is a requirement. >> >> If your's was comparable, how were your replication issues, merging issues, >> anthing else related to getting large datasets searchable and unique? >> >> Dennis Gearon >> >> >> Signature Warning >> >> It is always a good idea to learn from your own mistakes. It is usually a >> better >> idea to learn from others’ mistakes, so you do not have to make them >> yourself. >> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >> >> >> EARTH has a Right To Life, >> otherwise we all die. >> >> >> >> - Original Message >> From: pankaj bhatt >> To: solr-user@lucene.apache.org; gear...@sbcglobal.ne >> Sent: Mon, January 3, 2011 8:55:21 AM >> Subject: Re: uuid, COMB uuid, distributed farms >> >> Hi Dennis, >> >>I have used UUID's in my project to identify a basic installation of >> the client. >>Can i be of any help. >> >> / Pankaj Bhatt. >> >> On Mon, Jan 3, 2011 at 3:28 AM, Dennis Gearon >> wrote: >> >> > Planning ahead here. >> > >> > Anyone have experience with UUIDs, COMB UUIDs (sequential) in large, >> > internatiionally distributed Solr/Database project. >> > >> > Dennis Gearon >> > >> > >> > Signature Warning >> > >> > It is always a good idea to learn from your own mistakes. It is usually a >> > better >> > idea to learn from others’ mistakes, so you do not have to make them >> > yourself. >> > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' >> > >> > >> > EARTH has a Right To Life, >> > otherwise we all die. >> > >> > >> >> > -- Lance Norskog goks...@gmail.com
Re: Sub query using SOLR?
Essentially, a subuery is an AND expression where you ask the database to find the identifier or set of identifiers to then use in the query outside the subquery. The data that you put into a Solr index is flattened, denormalized. So take the suquery field values and put them in an AND part of the query to Solr. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Estrada Groups To: "solr-user@lucene.apache.org" Sent: Tue, January 4, 2011 10:33:29 AM Subject: Re: Sub query using SOLR? I am +1 on the interest on how to do this! Adam On Jan 4, 2011, at 1:26 PM, bbarani wrote: > > Hi, > > I am trying to use subquery in SOLR, is there a way to implement this using > SOLR query syntax? > > Something like > > Related_id: IN query(field=ud, q=”type:IT AND manager_12:dave”) > > The thing I really want is to use output of one query to be the input of > another query. > > Not sure if it is possible to use the query() function (function query) for > my case.. > > Just want to know if ther is a better approach... > > Thanks, > Barani > -- > View this message in context: >http://lucene.472066.n3.nabble.com/Sub-query-using-SOLR-tp2193251p2193251.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: uuid, COMB uuid, distributed farms
Thank you Pankaj. How large was your installation of Solr? I'm hoping to get mine to be multinational and making plans for that as I go. So having unique ids, UUIDs, that cover a huge addressable space is a requirement. If your's was comparable, how were your replication issues, merging issues, anthing else related to getting large datasets searchable and unique? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: pankaj bhatt To: solr-user@lucene.apache.org; gear...@sbcglobal.ne Sent: Mon, January 3, 2011 8:55:21 AM Subject: Re: uuid, COMB uuid, distributed farms Hi Dennis, I have used UUID's in my project to identify a basic installation of the client. Can i be of any help. / Pankaj Bhatt. On Mon, Jan 3, 2011 at 3:28 AM, Dennis Gearon wrote: > Planning ahead here. > > Anyone have experience with UUIDs, COMB UUIDs (sequential) in large, > internatiionally distributed Solr/Database project. > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them > yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
uuid, COMB uuid, distributed farms
Planning ahead here. Anyone have experience with UUIDs, COMB UUIDs (sequential) in large, internatiionally distributed Solr/Database project. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: dynamic fields revisited
When my Solr guru gets back, we'll redo the schema and see what happens, thanks! Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Thu, December 30, 2010 4:26:58 PM Subject: Re: dynamic fields revisited solr/admin/analysis.jsp uses the Luke handler. You can browse facets and fields. On Wed, Dec 29, 2010 at 7:46 PM, Ahmet Arslan wrote: >> If I understand you correctly, for an INT dynamic field >> called *_int2 >> filled with field callled my_number_int2 during data >> import >> in a query, I will search in the index on the field >> called: >> "my_number_int2" >> >> correct? >> > > Exactly. > > Using http://wiki.apache.org/solr/LukeRequestHandler you can retrieve real >field names under *_int2, if thats help. > > > > -- Lance Norskog goks...@gmail.com
Re: dynamic fields revisited
- Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Wed, December 29, 2010 6:11:32 PM Subject: Re: dynamic fields revisited >>> B/ Is the search done on the dynamic filed name in the schema, or on the >name >> that was matched? >The dynamic wildcard field name convention is only implemented by the >code that checks the schema. >It is not in the query syntax. Only the real field names are in the >query syntax or returned facets. If I understand you correctly, for an INT dynamic field called *_int2 filled with field callled my_number_int2 during data import in a query, I will search in the index on the field called: "my_number_int2" correct?
dynamic fields revisited
Well, getting close to the time when the 'rubber meets the road'. A couple of questions about dynamic fields. A/ How much room in the index do 'non used' dynamic fields add per record, any? B/ Is the search done on the dynamic filed name in the schema, or on the name that was matched? C/ Anyone done something like: //schema file// (representative, not actual) *_int1 *_int2 *_int3 *_int4 *_datetime1 *_datetime2 . . Then have fields in the imported data (especially using a DIH importing from a VIEW) that have custom names like: //import source//(representative, not actual) custom_labelA_int1 custom_labelB_int2 custom_labelC_datetime1 custom_labelD_datetime2 Is this how dynamic fields are used? I was thinking of having approximately 1-20 dynamic fields per datatype of interest. D/ If I wanted all text based dynamic fields added to some common field in the index (sorry, bad terminology), how is that done? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Item catagorization problem.
Doesn't indexing analyzing do this to some degree anyway? Not sure the alogrithm, but something like: How often, hom much near the top, how many differnt forms, subject or object of a sentence. That has to have some relevance to what category something is in. The simplest extension to that would be something like a 'sub vocabulary' cross listing. If such and such words were hi relevance, then the subject is about this or that. The smartest categorizer is your users, though. So the best way to make that list is to keep track of how close to the top of the search results did a user respond to his search results and what were the words, and how many search attempts did it take. That's waht netflix does. Their goal is to have users get something in theh top three off the first search attempt. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erick Erickson To: solr-user@lucene.apache.org Sent: Thu, December 23, 2010 10:00:05 AM Subject: Re: Item catagorization problem. What you're asking for appears to me to be "auto-categorization", and there's nothing built into Solr to do this. Somehow you need to analyze the documents at index time and add the proper categories, but I have no clue how. This is especially hard with short fields since most auto-categorization algorithms try to do some statistical analysis of the document to figure this out. Best Erick On Thu, Dec 23, 2010 at 8:12 AM, Hasnain wrote: > > Hi all, > > I am using solr in my web application for search purposes. However, i > am having a problem with the default behaviour of the solr search. > > From my understanding, if i query for a keyword, let's say "Laptop", > preference is given to result rows having more occurences of the search > keyword "Laptop" in the field "name". This, however, is producing > undesirable scenarios, for example: > > 1. I index an item A with "name" value "Sony Laptop". > 2. I index another item B with "name" value: "Laptop bags for laptops". > 3. I search for the keyword "Laptop" > > According to the default behaviour, precedence would be given to item B > since the keyword appears more times in the "name" field for that item. > > In my schema, i have another field by the name of "Category" and, for > example's sake, let's assume that my application supports only two > categories: computers and accessories. Now, what i require is a mechanism > to > assign correct categories to the items during item indexing so that this > field can be used to better filter the search results, item A would belong > to "Computer" category and item B would belong to "Accessories" category. > So > then, searching for "Laptop" would only look for items in the "Computers" > category and return item A only. > > I would like to point out here that setting the category field manually is > not an option since the data might be in the vicinity of thousands of > records. I am not asking for an in-depth algorithm. Just a high level > design > would be sufficient to set me in the right direction. > > thanks. > > > -- > View this message in context: >http://lucene.472066.n3.nabble.com/Item-catagorization-problem-tp2136415p2136415.html >l > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Recap on derived objects in Solr Index, 'schema in a can'
I think I'm just going to have to have my partner and I play with both cores and dynamic fields. If multiple cores are queried, and the schemas match up in order and postion for the base fields, the 'extra' fields in the different cores just show up in the result set with their field names? The query against different cores, with 'base attributes' and 'extended attributes' has to be tailored for each core, right? I.E., not querying for fields that don't exist? (That could be handled by making the query a server side langauge object with inheritance for the extended fields) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog To: solr-user@lucene.apache.org Sent: Wed, December 22, 2010 1:45:04 PM Subject: Re: Recap on derived objects in Solr Index, 'schema in a can' A dynamic field just means that the schema allows any field with a name matching the wildcard. That's all. There is no support for referring to all of the existing fields in the wildcard. That is, there is no support for "*_en:word" as a field search. Nor is there any kind of grouping for facets. The feature for addressing a particular field in some of the parameters does not support wildcards. If you add wildcard fields, you have to remember what they are. On Wed, Dec 22, 2010 at 11:04 AM, Dennis Gearon wrote: > I'm open to cores, if it's the faster(indexing/querying/keeping mentally > straight) way to do things. > > But from what you say below, the eventual goal of the site would mean either >100 > extra 'generic' fields, or 1,000-100,000's of cores. > Probably cores is easier to administer for security and does more accurate > querying? > > What is the relationship between dynamic fields and the schema? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > > > > - Original Message > From: Erick Erickson > To: solr-user@lucene.apache.org > Sent: Wed, December 22, 2010 10:44:27 AM > Subject: Re: Recap on derived objects in Solr Index, 'schema in a can' > > No, one cannot ignore the schema. If you try to add a field not in the > schema you get > an error. One could, however, use any arbitrary subset > of the fields defined in the schema for any particular #document# in the > index. Say > your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one > doc, and > fields f6-f10 in another and f1, f4, f9 in another and. > > The only field(s) that #must# be in a document are the required="true" > fields. > > There's no real penalty for omitting fields from particular documents. This > allows > you to store "special" documents that aren't part of normal searches. > > You could, for instance, use a document to store meta-information about your > index that had whatever meaning you wanted in a field(s) that *no* other > document > had. Your app could then read that "special" document and make use of that > info. > Searches on "normal" documents wouldn't return that doc, etc. > > You could effectively have N indexes contained in one index where a document > in each logical sub-index had fields disjoint from the other logical > sub-indexes. > Why you'd do something like that rather than use cores is a very good > question, > but you #could# do it that way... > > All this is much different from a database where there are penalties for > defining > a large number of unused fields. > > Whether doing this is wise or not given the particular problem you're trying > to > solve is another discussion .. > > Best > Erick > > On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon wrote: > >> Based on more searches and manual consolidation, I've put together some of >> the ideas for this already suggested in a summary below. The last item in >> the >> summary >> seems to be interesting, low technical cost way of doing it. >> >> Basically, it treats the index like a 'BigTable', a la "No SQL". >> >> Erick E
Re: Recap on derived objects in Solr Index, 'schema in a can'
I'm open to cores, if it's the faster(indexing/querying/keeping mentally straight) way to do things. But from what you say below, the eventual goal of the site would mean either 100 extra 'generic' fields, or 1,000-100,000's of cores. Probably cores is easier to administer for security and does more accurate querying? What is the relationship between dynamic fields and the schema? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erick Erickson To: solr-user@lucene.apache.org Sent: Wed, December 22, 2010 10:44:27 AM Subject: Re: Recap on derived objects in Solr Index, 'schema in a can' No, one cannot ignore the schema. If you try to add a field not in the schema you get an error. One could, however, use any arbitrary subset of the fields defined in the schema for any particular #document# in the index. Say your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one doc, and fields f6-f10 in another and f1, f4, f9 in another and. The only field(s) that #must# be in a document are the required="true" fields. There's no real penalty for omitting fields from particular documents. This allows you to store "special" documents that aren't part of normal searches. You could, for instance, use a document to store meta-information about your index that had whatever meaning you wanted in a field(s) that *no* other document had. Your app could then read that "special" document and make use of that info. Searches on "normal" documents wouldn't return that doc, etc. You could effectively have N indexes contained in one index where a document in each logical sub-index had fields disjoint from the other logical sub-indexes. Why you'd do something like that rather than use cores is a very good question, but you #could# do it that way... All this is much different from a database where there are penalties for defining a large number of unused fields. Whether doing this is wise or not given the particular problem you're trying to solve is another discussion .. Best Erick On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon wrote: > Based on more searches and manual consolidation, I've put together some of > the ideas for this already suggested in a summary below. The last item in > the > summary > seems to be interesting, low technical cost way of doing it. > > Basically, it treats the index like a 'BigTable', a la "No SQL". > > Erick Erickson pointed out: > "...but there's absolutely no requirement > that all documents in SOLR have the same fields..." > > I guess I don't have the right understanding of what goes into a Document > in Solr. Is it just a set of fields, each with it's own independent field > type > declaration/id, it's name, and it's content? > > So even though there's a schema for an index, one could ignore it and > jsut throw any other named fields and types and content at document > addition > time? > > So If I wanted to search on a base set, all documents having it, I could > then > additionally filter based on the (might be wrong use of this) dynamic > fields? > > > > > > > Origninal Thread that I started: > > >http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html >l > > >- >- > > Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!): > >- >- > > > 1/ Base object of some kind, x number of fields > 2/ Derived objects representing Divisiion in company, different customer > bases, > etc. > each having 2 additional, unique fields. > 3/ Assume 1000 such derived object types > 4/ A 'flattened' Index would have the x base object fields, >and 2000 additional fields > > > > Solutions Posited > --- > > A/ First thought, muliti-value columns as key pairs. > 1/ Difficult to access individual items of more than one 'word' length > for querying in multivalued fields. > 2/ All sorts of statistical stuff probably wouldn't apply? > 3/ (James Dayer said:) There
Re: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) > 0 AND other_criteria
Have you investigated 'field collapsing'? I believe that it is a least the 'DISTINCT' part. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: dan sutton To: solr-user Sent: Wed, December 22, 2010 1:29:23 AM Subject: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) > 0 AND other_criteria Hi, Is there a way with faceting or field collapsing to do the SQL equivalent of SELECT count(distinct(field)) FROM index WHERE length(field) > 0 AND other_criteria i.e. I'm only interested in the total count not the individual records and counts. Cheers, Dan
Re: Consequences for using multivalued on all fields
Thanks you for the input. You might have seen my posts about doing a flexible schema for derived objects. Sounds like dynamic fields might be the ticket. We'll be ready to test the idea in about a month, mabye 3 weeks. I'll post a comment about it whn it gets there. I don't know if I would gain anything, but I think that ALL boolean that were NOT in the base object but wehre in the derived objects could be put into one field and textually positioned key:pairs, at least for searh purposes. Since the derived object would have it's own, additional methods, one of those methods could be to 'unserialize' the 'boolean column'. In fact, that could be a base object function - Empty boolean column values just end up not populating any extra base object attiributes. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: kenf_nc To: solr-user@lucene.apache.org Sent: Tue, December 21, 2010 6:07:51 AM Subject: Re: Consequences for using multivalued on all fields I have about 30 million documents and with the exception of the Unique ID, Type and a couple of date fields, every document is made of dynamic fields. Now, I only have maybe 1 in 5 being multi-value, but search and facet performance doesn't look appreciably different from a fixed schema solution. I don't do some of the fancier things, highlighting, spell check, etc. And I use a lot more string or lowercase field types than I do Text (so not as many fully tokenized fields), that probably helps with performance. The only disadvantage I know of is dealing with field names at runtime. Depending on your architecture, you don't really know what your document looks like until you have it in a result set. For what I'm doing, that isn't a problem. -- View this message in context: http://lucene.472066.n3.nabble.com/Consequences-for-using-multivalued-on-all-fields-tp2125867p2126120.html Sent from the Solr - User mailing list archive at Nabble.com.
Recap on derived objects in Solr Index, 'schema in a can'
Based on more searches and manual consolidation, I've put together some of the ideas for this already suggested in a summary below. The last item in the summary seems to be interesting, low technical cost way of doing it. Basically, it treats the index like a 'BigTable', a la "No SQL". Erick Erickson pointed out: "...but there's absolutely no requirement that all documents in SOLR have the same fields..." I guess I don't have the right understanding of what goes into a Document in Solr. Is it just a set of fields, each with it's own independent field type declaration/id, it's name, and it's content? So even though there's a schema for an index, one could ignore it and jsut throw any other named fields and types and content at document addition time? So If I wanted to search on a base set, all documents having it, I could then additionally filter based on the (might be wrong use of this) dynamic fields? Origninal Thread that I started: http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html - Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!): - 1/ Base object of some kind, x number of fields 2/ Derived objects representing Divisiion in company, different customer bases, etc. each having 2 additional, unique fields. 3/ Assume 1000 such derived object types 4/ A 'flattened' Index would have the x base object fields, and 2000 additional fields Solutions Posited --- A/ First thought, muliti-value columns as key pairs. 1/ Difficult to access individual items of more than one 'word' length for querying in multivalued fields. 2/ All sorts of statistical stuff probably wouldn't apply? 3/ (James Dayer said:) There's also one "gotcha" we've experienced when searching acrosse multi-valued fields: SOLR will match across field occurences. In the example below, if you were to search q=contrib_name:(james AND smith), you will get this record back. It matches one name from one contributor and another name from a different contributor. This is not what our users want. As a work-around, I am converting these to phrase queries with slop: "james smith"~50 ... Just use a slop # smaller than your positionIncrementGap and bigger than the # of terms entered. This will prevent the cross-field matches yet allow the words to occur in any order. The problem with this approach is that Lucene doesn't support wildcards in phrases B/ Dynamic fields was suggested, but I am not sure exactly how they work, and the person who suggested it was not sure it would work, either. C/ Different field naming conventions were suggested in field types were similar. I can't predict that. D/ Found this old thread, and i had other suggestions: 1/ Use multiple cores, one for each record type/schema, aggregate them in during the query. 2/ Use a fixed number of additional fields X 2. Eatch additional field is actually a pair of fields. The first of the pair gives the colmn name, the second gives the data. a) Although I like this, I wonder how many extra fields to use, b) it was pointed out that relevancy and other statistical criterial for queries might suffer. 3/ Index the different objects exactly as they are, i.e. as Erick Erickson said: "I'm not entirely sure this is germane, but there's absolutely no requirement that all documents in SOLR have the same fields. So it's possible for you to index the "wildly different content" in "wildly different fields" . Then searching for screen:LCD would be straightforward."... Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: A schema inside a Solr Schema (Schema in a can)
Here is a thread on this subject that I did not find earlier. Sometimes discussion, thought, and 'mulling' in the subconcious gets me better Google searches. http://lucene.472066.n3.nabble.com/multi-valued-associated-fields-td811883.html Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Dennis Gearon To: solr-user@lucene.apache.org Sent: Mon, December 20, 2010 10:19:53 AM Subject: Re: A schema inside a Solr Schema (Schema in a can) Thanks James. So being accurate with fields with fields(mulitvalues) is probably not possible using all the currently made analyzers. - Original Message From: "Dyer, James" To: "solr-user@lucene.apache.org" Sent: Mon, December 20, 2010 7:16:43 AM Subject: RE: A schema inside a Solr Schema (Schema in a can) Dennis, If you need to search a key/value pair, you'll have to put them both in the same field, somehow. One way is to re-index them using the key in the fieldname. For instance, suppose you have: contributor: dyer, james contributor: smith, sam role: author role: editor ...but you want to search only for authors, you could index these again with fieldnames like: contrib_author: dyer, james contrib_editor: smith, sam Then you would query "q=contributor:smtih" to search all contribtors and q=contrib_editor:smith just to get editors. Another way to do it is to use some type of marker character sequence to define the "key" and index it like this: contributor: dyer, james __author contributor: smith, sam __editor then you can query like this: "q=contributor:"smith __editor"~50 ... to search only for editors named Smith. We are not yet fully developed here on SOLR but we currently use both of these approaches using a different search engine. One nice thing SOLR could add to this second approach that is not an option with our other system is the possibility of writing a custom analyzer that could maybe take some of the complexity out of the app. Not sure exactly how it'd work though... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Friday, December 17, 2010 6:52 PM To: solr-user@lucene.apache.org Subject: RE: A schema inside a Solr Schema (Schema in a can) So this is a current usable plugin (except for the latest bug)? And, is it possible to search jwithin ust one key:value pair in a multivalued field? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Fri, 12/17/10, Ahmet Arslan wrote: > From: Ahmet Arslan > Subject: RE: A schema inside a Solr Schema (Schema in a can) > To: solr-user@lucene.apache.org > Date: Friday, December 17, 2010, 12:47 PM > > The problem with this approach > is that Lucene doesn't > > support wildcards in phrases. > > With https://issues.apache.org/jira/browse/SOLR-1604 you can > do that. > > > >
Re: A schema inside a Solr Schema (Schema in a can)
Thanks James. So being accurate with fields with fields(mulitvalues) is probably not possible using all the currently made analyzers. - Original Message From: "Dyer, James" To: "solr-user@lucene.apache.org" Sent: Mon, December 20, 2010 7:16:43 AM Subject: RE: A schema inside a Solr Schema (Schema in a can) Dennis, If you need to search a key/value pair, you'll have to put them both in the same field, somehow. One way is to re-index them using the key in the fieldname. For instance, suppose you have: contributor: dyer, james contributor: smith, sam role: author role: editor ...but you want to search only for authors, you could index these again with fieldnames like: contrib_author: dyer, james contrib_editor: smith, sam Then you would query "q=contributor:smtih" to search all contribtors and q=contrib_editor:smith just to get editors. Another way to do it is to use some type of marker character sequence to define the "key" and index it like this: contributor: dyer, james __author contributor: smith, sam __editor then you can query like this: "q=contributor:"smith __editor"~50 ... to search only for editors named Smith. We are not yet fully developed here on SOLR but we currently use both of these approaches using a different search engine. One nice thing SOLR could add to this second approach that is not an option with our other system is the possibility of writing a custom analyzer that could maybe take some of the complexity out of the app. Not sure exactly how it'd work though... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Friday, December 17, 2010 6:52 PM To: solr-user@lucene.apache.org Subject: RE: A schema inside a Solr Schema (Schema in a can) So this is a current usable plugin (except for the latest bug)? And, is it possible to search jwithin ust one key:value pair in a multivalued field? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Fri, 12/17/10, Ahmet Arslan wrote: > From: Ahmet Arslan > Subject: RE: A schema inside a Solr Schema (Schema in a can) > To: solr-user@lucene.apache.org > Date: Friday, December 17, 2010, 12:47 PM > > The problem with this approach > is that Lucene doesn't > > support wildcards in phrases. > > With https://issues.apache.org/jira/browse/SOLR-1604 you can > do that. > > > >