[JOB] Lucid Imagination is hiring
Hi All, If you've wanted a full time job working on Lucene or Solr, we have two positions open that just might be of interest. The job descriptions are below. Interested candidates should submit their resumes off list to care...@lucidimagination.com. You can learn more on our website: http://www.lucidimagination.com/about/careers. Thanks, Grant -Open Source Software Engineer DESCRIPTION Lucid Imagination is looking for a software engineer to work on the open source Apache Solr and Lucene projects. As part of Lucid's open source team, you will help implement features and provide fixes for issues in the world's premier open source search server and library. You will also work closely with Lucid's research team and technical support team to enable both community and customer consumption of Solr and Lucene. REQUIREMENTS • Strong interest in working on high performance and large scale problems. • Understanding of debugging and performance testing in a highly concurrent systems. • Core Java expertise. • Experience writing unit tests and working with continuous integration tools. • Willingness to participate in and contribute to a vibrant, fast-paced open source community. • Strong interpersonal, written and verbal communication skills. • Desire to learn and be a part of a startup. • Degree in computer science or related field. • Experience with Lucene, Solr, Hadoop and related NoSQL technologies is not required, but is considered a bonus. EXPERIENCE 0-5 years programming experience in Java. SALARY Based on experience LOCATION Raleigh/Durham/Chapel Hill area (preferred) TRAVEL Minimal (occasional trips to California) - Senior Consultant DESCRIPTION Lucid Imagination is currently looking to hire a Senior Consultant to be part of our Professional Services team. REQUIREMENTS • Experience working with Lucene and/or Solr required. • Establish yourself as a credible, reliable, likable, genuine, and trustworthy advisor to your customers. • Provide expert-level advisory services to a wide range of customers with varying degrees of technical knowledge. • Clearly identify customer pain points, priorities, and success criteria at the onset of each engagement. • Resolve complex search issues in and around the Lucene/Solr ecosystem. • Document recommendations in the form of Best Practice Assessments. • Identify opportunities to provide customers with additional value through follow-on products and/or services. • Communicate high-value use cases and customer feedback to our Product Development and Engineering teams. • Contribute to the open source community by donating needed bug fixes and improvements; answering message boards; documenting existing code; and blogging. • Support Business Development through product demos and customer QA. • Collaborate on internal Lucid projects. • Develop training materials and deliver classroom training on occasion. EXPERIENCE • BS or higher in Engineering or Computer Science preferred. • 3 or more years of IT Consulting and/or Professional Services experience required. • Some Java development experience. • Some experience with common scripting languages (Perl/Python/Ruby). • Exposure to other related open source projects (Mahout, Hadoop, Tika, etc.) a plus. • Experience with other commercial and open source search technologies a plus. • Enterprise Search, eCommerce, and/or Business Intelligence experience a plus. • Experience working in a startup a plus. SALARY Based on experience LOCATION: San Francisco/Bay Area (preferred) TRAVEL: 10-20%
Mixing norms and no norms in the same document
Hi. I'm indexing about 20,000 documents that could potentially have a few thousand fields with the same field name. I've read in the mailing list archives that there is no hard limit to the number of fields in a document, but that storing norms can be a problem because of the RAM overhead. I don't plan to boost documents or this particular set of like-named fields, so I think I can index them with ANALYZED_NO_NORMS. But will this cause a problem with scoring if I want to boost other fields in the same document? Thanks!
Re: Mixing norms and no norms in the same document
On Mon, Dec 5, 2011 at 5:44 PM, Rob Hasselbaum r...@hasselbaum.net wrote: Hi. I'm indexing about 20,000 documents that could potentially have a few thousand fields with the same field name. I've read in the mailing list archives that there is no hard limit to the number of fields in a document, but that storing norms can be a problem because of the RAM overhead. I don't plan to boost documents or this particular set of like-named fields, so I think I can index them with ANALYZED_NO_NORMS. But will this cause a problem with scoring if I want to boost other fields in the same document? Thanks! as long as you are consistent you should not have any problems. You can have different norm settings on any field. Yet, if you index the same field with and without norms you will end up with no norms eventually. simon - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Use multiple lucene indices
Hi All, We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. The requirements of the project are below: 1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 2. We are planning to keep less accessed contents outside of our database, store them on the file system. 3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are: 1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher? 2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. Thank you very much for your help. Many thanks, Rui Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re:Use multiple lucene indices
hi, below is some hints from my experience: 1. if you use one index per file, and many indexsearcher open at the same time, you may meet 'too many open files' error. you have to increase file_max value of os. 2. if these index files have less concurrent access, i think it's reasonable that open new searcher for every access. meanwhile, if you use lucene sort feature, field cache may consume many memory. thus too many opened indexsearcher at the same time could exhaust all memory of your machine. -- gang liu email: liuga...@gmail.com At 2011-12-06 01:58:29,Rui Wang rw...@ebi.ac.uk wrote: Hi All, We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. The requirements of the project are below: 1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 2. We are planning to keep less accessed contents outside of our database, store them on the file system. 3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are: 1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher? 2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. Thank you very much for your help. Many thanks, Rui Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SpanNearQuery and matching spans inside the first span
Supposing I have a document with just hi there as the text. If I do a span query like this: near(near(term('hi'), term('there'), slop=0, forwards), term('hi'), slop=1, any-direction) that returns no hits. However, if I do a span query like this: near(near(term('hi'), term('there'), slop=0, forwards), term('there'), slop=1, any-direction) that returns the document. It seems that the rule is that if the two spans *start* at the same position, then they are not considered near each other. But from the POV of a user (and from this developer) this is lop-sided because in both situations, the second span was inside the first span. It seems like they should either both be considered hits, or both be considered non-hits. I am wondering what others think about this and whether there is any way to manipulate/rewrite the query to get a more balanced-looking result. (I'm sure it gets particularly hairy, though, when your two spans overlap only partially... is that near or not?) TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene index inside of a web app?
Hi Check http://tomcat.apache.org 80% of the Web containers follow the same stattegy web.xml is well explained in this URL. cBy the way which WEB Container do u use ? with regards karthik On Fri, Dec 2, 2011 at 7:54 PM, okayndc bodymo...@gmail.com wrote: What would the web.xml look like? I'm lost. On Thu, Dec 1, 2011 at 11:04 PM, KARTHIK SHIVAKUMAR nskarthi...@gmail.comwrote: Hi generated Lucene index What if u need to upgrade this with More docs Best approach is Inject the Real path of the Index ( c:/temp/Indexes ) to the Web server Application via web.xml By this approach u can even achieve 1) Load balancing of multiple Web servers pointing to same Index files 2) Update /Delete /Re-index with out the Web application being interrupted with regards Karthik On Tue, Nov 29, 2011 at 12:25 AM, okayndc bodymo...@gmail.com wrote: Awesome. Thanks guys! On Mon, Nov 28, 2011 at 12:19 PM, Uwe Schindler u...@thetaphi.de wrote: You can store the index in WEB_INF directory, just use something: ServletContext.getRealPath(/WEB-INF/data/myIndexName); - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Ian Lea [mailto:ian@gmail.com] Sent: Monday, November 28, 2011 6:11 PM To: java-user@lucene.apache.org Subject: Re: Lucene index inside of a web app? Using a static string is fine - it just wasn't clear from your original post what it was. I usually use a full path read from a properties file so that I can change it without a recompile, have different settings on test/live/whatever systems, etc. Works for me, but isn't the only way to do it. If you know where your app lives, you could use a full path pointing to somewhere within that tree, or you could use a partial path that the app server will interpret relative to something. Which is fine too - take your pick of whatever works for you. -- Ian. On Mon, Nov 28, 2011 at 4:40 PM, okayndc bodymo...@gmail.com wrote: Hi, Thanks for your response. Yes, LUCENE_INDEX_DIRECTORY is a static string which contains the file system path of the index (for example, c:\\index). Is this good practice? If not, what should the full path to an index look like? Thanks On Mon, Nov 28, 2011 at 4:54 AM, Ian Lea ian@gmail.com wrote: What is LUCENE_INDEX_DIRECTORY? Some static string in your app? Lucene knows nothing about your app, JSP, or what app server you are using. It requires a file system path and it is up to you to provide that. I always use a full path since I prefer to store indexes outside the app and it avoids complications with what the app server considers the default directory. But if you want to store it inside, without specifying full path, look at the docs for your app server. -- Ian. On Sun, Nov 27, 2011 at 2:10 AM, okayndc bodymo...@gmail.com wrote: Hello, I want to store the generated Lucene index inside of my Java application, preferably within a folder where my JSP files are located. I also want to be able to search from the index within the web app. I've been using the LUCENE_INDEX_DIRECTORY but, this is on a file system (currently my hard drive). Should I continue to use LUCENE_INDEX_DIRECTORY if I want the Lucene index inside the app or use something else. I was a bit confused about this. Btw, the Lucene index content comes from a database. Any help is appreciated - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094*
Re: lucene-core-3.3.0 not optimizing
Hi LUCENE-3454 http://issues.apache.org/jira/browse/LUCENE-3454: So u mean the code has changed with this API ... Does any body have any sample code snippet or is there a sample to play around with regards karthik On Fri, Dec 2, 2011 at 3:44 PM, Ian Lea ian@gmail.com wrote: Well, calling optimize(maxNumSegments) will (from the javadocs on recent releases) Optimize the index down to = maxNumSegments. So optimize(100) won't get you down to 1 big file, unless you are using compound files perhaps. Maybe it did something different 7 years ago but that seems very unlikely. In 3.5.0 all optimize() calls are deprecated anyway. I suggest you read the release notes and the javadocs, upgrade to 3.5.0 and remove all optimize() calls altogether. -- Ian. On Fri, Dec 2, 2011 at 9:58 AM, KARTHIK SHIVAKUMAR nskarthi...@gmail.com wrote: Hi I have used Index and Optimize 5+ Million XML docs in Lucene 1.x7 years ago, And this piece of IndexWriter.optimize used to Merger all the bits and pieces of the created into 1 big file. I have not tracked the API changes since 7 yearsand with lucene-core-3.3.0 ...on google not able to find the solutions Why this is happening. with regards karthik On Fri, Dec 2, 2011 at 12:37 PM, Simon Willnauer simon.willna...@googlemail.com wrote: what do you understand when you say optimize? Unless you tell us what this code does in your case and what you'd expect it doing its impossible to give you any reasonable answer. simon On Fri, Dec 2, 2011 at 4:54 AM, KARTHIK SHIVAKUMAR nskarthi...@gmail.com wrote: Hi Spec O/s win os 7 Jdk : 1.6.0_29 Lucene lucene-core-3.3.0 Finally after Indexing successfully ,Why this Code does not optimize ( sample code ) INDEX_WRITER.optimize(100); INDEX_WRITER.commit(); INDEX_WRITER.close(); *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094*
Re: [JOB] Lucid Imagination is hiring
Hi Too bad during Recession Am from INDIA ;( with regards karthik On Mon, Dec 5, 2011 at 9:10 PM, Grant Ingersoll gsing...@apache.org wrote: Hi All, If you've wanted a full time job working on Lucene or Solr, we have two positions open that just might be of interest. The job descriptions are below. Interested candidates should submit their resumes off list to care...@lucidimagination.com. You can learn more on our website: http://www.lucidimagination.com/about/careers. Thanks, Grant -Open Source Software Engineer DESCRIPTION Lucid Imagination is looking for a software engineer to work on the open source Apache Solr and Lucene projects. As part of Lucid's open source team, you will help implement features and provide fixes for issues in the world's premier open source search server and library. You will also work closely with Lucid's research team and technical support team to enable both community and customer consumption of Solr and Lucene. REQUIREMENTS • Strong interest in working on high performance and large scale problems. • Understanding of debugging and performance testing in a highly concurrent systems. • Core Java expertise. • Experience writing unit tests and working with continuous integration tools. • Willingness to participate in and contribute to a vibrant, fast-paced open source community. • Strong interpersonal, written and verbal communication skills. • Desire to learn and be a part of a startup. • Degree in computer science or related field. • Experience with Lucene, Solr, Hadoop and related NoSQL technologies is not required, but is considered a bonus. EXPERIENCE 0-5 years programming experience in Java. SALARY Based on experience LOCATION Raleigh/Durham/Chapel Hill area (preferred) TRAVEL Minimal (occasional trips to California) - Senior Consultant DESCRIPTION Lucid Imagination is currently looking to hire a Senior Consultant to be part of our Professional Services team. REQUIREMENTS • Experience working with Lucene and/or Solr required. • Establish yourself as a credible, reliable, likable, genuine, and trustworthy advisor to your customers. • Provide expert-level advisory services to a wide range of customers with varying degrees of technical knowledge. • Clearly identify customer pain points, priorities, and success criteria at the onset of each engagement. • Resolve complex search issues in and around the Lucene/Solr ecosystem. • Document recommendations in the form of Best Practice Assessments. • Identify opportunities to provide customers with additional value through follow-on products and/or services. • Communicate high-value use cases and customer feedback to our Product Development and Engineering teams. • Contribute to the open source community by donating needed bug fixes and improvements; answering message boards; documenting existing code; and blogging. • Support Business Development through product demos and customer QA. • Collaborate on internal Lucid projects. • Develop training materials and deliver classroom training on occasion. EXPERIENCE • BS or higher in Engineering or Computer Science preferred. • 3 or more years of IT Consulting and/or Professional Services experience required. • Some Java development experience. • Some experience with common scripting languages (Perl/Python/Ruby). • Exposure to other related open source projects (Mahout, Hadoop, Tika, etc.) a plus. • Experience with other commercial and open source search technologies a plus. • Enterprise Search, eCommerce, and/or Business Intelligence experience a plus. • Experience working in a startup a plus. SALARY Based on experience LOCATION: San Francisco/Bay Area (preferred) TRAVEL: 10-20% -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094*
Re: Use multiple lucene indices
hi would the memory usage go through the roof? Yup My past experience got me pickels in there... with regards karthik On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote: Hi All, We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. The requirements of the project are below: 1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 2. We are planning to keep less accessed contents outside of our database, store them on the file system. 3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are: 1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher? 2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. Thank you very much for your help. Many thanks, Rui Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094*
Lucene bangalore chapter
is there a lucene Bangalore chapter ? -Vinaya - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org