RE: [ANNOUNCE] Web Crawler

2013-07-15 Thread karl.wright
Usually, if a webmaster finds that your crawler has ignored their robots.txt, they will block you machine, or maybe even your entire IP block, from accessing their site. Karl -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, July 15, 2013 9:30 AM To

Jar packaging issue

2013-02-04 Thread karl.wright
Hello anyone, We recently ran into something people might not be fully aware of. Specifically, because codec jars require META-INF/services files in order to be discovered, and each codec has the same files, it's not a straightforward operation to glom all the Lucene jars of interest into one

RE: is it possible to index wiki markup files?

2012-01-11 Thread karl.wright
You might be interested in looking at ManifoldCF for getting your documents into Solr. See http://incubator.apache.org/connectors for more details. Karl -Original Message- From: ext Reyna Melara [mailto:reynamel...@gmail.com] Sent: Wednesday, January 11, 2012 2:13 PM To: java-user@luc

RE: what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread karl.wright
It's also worth looking at ManifoldCF. Karl -Original Message- From: ext Markus Jelsma Sent: 23/08/2011, 6:24 AM To: solr-u...@lucene.apache.org Cc: java-user@lucene.apache.org Subject: Re: what's the status of droids project(http://incubator.apache.org/droids/)? You should ask on the

RE: [Help Wanted] Graphics and other help for new Lucene/Solr website

2011-08-10 Thread karl.wright
The site looks great. And thank you for including the ManifoldCF link. ;-) Karl -Original Message- From: ext Grant Ingersoll [mailto:gsing...@apache.org] Sent: Wednesday, August 10, 2011 10:09 AM To: solr-u...@lucene.apache.org; java-user@lucene.apache.org Subject: [Help Wanted] Graphic

RE: need help

2011-06-21 Thread karl.wright
You might want to look at ManifoldCF too. http://incubator.apache.org/connectors/ Karl -Original Message- From: ext Marlen [mailto:zmach...@facinf.uho.edu.cu] Sent: Tuesday, June 21, 2011 9:49 AM To: java-user@lucene.apache.org Subject: need help I need to create a search engine that s

RE: [ANNOUNCE] Web Crawler

2011-05-15 Thread karl.wright
You might want to look at ManifoldCF also. Karl -Original Message- From: ext abhayd [mailto:ajdabhol...@hotmail.com] Sent: Saturday, May 14, 2011 9:29 AM To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler hi Dominique, I am looking for a crawler to feed solr index. Aft

RE: how to get all documents in the results ?

2011-03-22 Thread karl.wright
Not sure what your use case actually is, but it sounds like you may be unclear how Lucene works. Each query clause you have will produce an iterator that walks over the documents that match that clause. All the documents from the entire, root query get scored. The scoring evaluation per docum

RE: ManifoldCF in Action

2011-03-10 Thread karl.wright
Ah, I was not thinking of a Solr addon! I thought you were referring to some other crawler that I'd never heard of. So the answer to your question is that ManifoldCF differs from DIH in at least the following ways: - ManifoldCF can handle a wide range of repositories, not just database tables

Re: ManifoldCF in Action

2011-03-10 Thread karl.wright
>> Karl, can you give, in one paragraph, the difference between ManifoldCF and DIH? thanks in advance paul << I am unfamiliar with DIH as an acronym in either the content management or crawling infrastructure space. Can you clarify what you mean? Karl

ManifoldCF in Action

2011-03-01 Thread karl.wright
Dear Lucene/Solr user, It is possible you may not know of an Apache project called ManifoldCF, whose purpose is to provide content to Solr for index. If you have interest in this project, this is to inform you that the ManifoldCF book from Manning Publishing, titled ManifoldCF in Action, is no