shrinath.m shrinat...@webyog.com wrote:
Consider we've offline HTML pages, no parsing while crawling, now what ?
Any tokenizer someone has built for this ?
In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages
by selecting only text between certain tags, before indexing
James Wilson james_wil...@nmcourt.fed.us wrote:
I have completed a project to do the exact same thing. I put the pdf
text in XML files. Then after I do a Lucene search I read the text from
the XML files. I do not store the text in the Lucene index. That would
bloat the index and slow down
languages.
Bill
With this solution :
1. I only need one field (or two if I want both stemmed and unstemmed
processing)
2. The user can search in all document regarless to there language
I hope this help.
Dominique
www.zoonix.fr
www.crawl-anywhere.com
Le 20/01/11 00:29, Bill Janssen
Clemens Wyss clemens...@mysign.ch wrote:
1) Docs in different languages -- every document is one language
2) Each document has fields in different languages
We mainly have 1)-models
I've recently done this for UpLib. I run a language-guesser over the
document to identify the primary
reasonable corpus to be convinced it
would be worth it.
Bill
paul
Le 19 janv. 2011 à 19:21, Bill Janssen a écrit :
Clemens Wyss clemens...@mysign.ch wrote:
1) Docs in different languages -- every document is one language
2) Each document has fields in different languages
We mainly
Paul Libbrecht p...@hoplahup.net wrote:
I did several changes of this sort and the precision and recall
measures went better in particular in presence of language-indication
failure which happened to be very common in our authoring environment.
There are two kinds of failures: no language,
Grant Ingersoll gsing...@apache.org wrote:
Where do you get your Lucene/Solr downloads from?
[x] ASF Mirrors (linked in our release announcements or via the Lucene
website)
[] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
[x] I/we build them from source via an
Ching zchin...@gmail.com wrote:
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
which tools do you use to extract text from pdf? Thanks.
Ching, in UpLib I use a patched version of xpdf which reports the
bounding box and font information for each word (as well as the
Robert Muir rcm...@gmail.com wrote:
On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen jans...@parc.com wrote:
I thought that since I'm updating UpLib's Lucene code, I should tackle
the issue of document languages, as well. Right now I'm using an
off-the-shelf language identifier, textcat
I thought that since I'm updating UpLib's Lucene code, I should tackle
the issue of document languages, as well. Right now I'm using an
off-the-shelf language identifier, textcat, to figure out which language
a Web page or PDF is (mainly) written in. I then want to analyze that
document with an
Simon Willnauer simon.willna...@googlemail.com wrote:
On Fri, Sep 17, 2010 at 11:45 PM, Bill Janssen jans...@parc.com wrote:
Simon Willnauer simon.willna...@googlemail.com wrote:
On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen jans...@parc.com wrote:
Simon Willnauer simon.willna
Simon Willnauer simon.willna...@googlemail.com wrote:
Hey Bill,
let me clarify what Version is used for since I think that caused
little confusion.
Thanks.
The Version constant was mainly introduced to help
users with backwards compatibility and upgrading their codebase to a
new version
Simon Willnauer simon.willna...@googlemail.com wrote:
On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen jans...@parc.com wrote:
Simon Willnauer simon.willna...@googlemail.com wrote:
Hey Bill,
let me clarify what Version is used for since I think that caused
little confusion.
Thanks
Bill Janssen jans...@parc.com wrote:
...is there any attribute or static
method somewhere in Lucene which will return a value of
org.apache.lucene.util.Version that corresponds to the version of the
code? That's what I'm looking for. Version.LUCENE_CURRENT looks good,
but it's deprecated
So, in version 3, I have to pass a version parameter to the constructor
for StandardAnalyzer. Since Version.LUCENE_CURRENT is deprecated, I'd
like this to be the same as the version of the index I'm using. Is
there a way of getting a value of Version for an index? I don't see
obvious methods on
I do this with uplib (http://uplib.parc.com/) with fair success.
Originally I thought I'd need Lucene plus a relational database to
store metadata about the documents for metadata searches. So far,
though, I've been able to store the metadata in Lucene and use the
same Lucene DB for both metadata
Problem I am having is that some of them has multiple columns. and multiple
word boxes. Does the xpdf patch extract different columns and wordboxes?
It tells you where each word is. Columns you have to do for yourself.
Bill
In UpLib, I use xpdf-3.02pl2 with a patch which gives me position
the unix program pdf2text can convert keeping the text places, but I wanted
to ask you guys if you know something better,
AFAIK, PDFBox has a lower-level API that allows you to get hold of text
positions.
In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
font
- Fetch and index some pages (containing word and pdf documents) on
daily basis.
- Extract all pages that contain some provided keywords after fetching
the pages.
- Create some bulletin from fetched pages, bulletin will be in pdf
format and are categorized based on keywords.
- provide
I'll see if I can get back to this over the weekend.
I got a chance to copy my corpus to another G4 and try indexing with
Lucene 2.2. This one seems OK! Same texts. So now I'm inclined to
believe that it *is* the machine, rather than the code. Whew! Though
that doesn't explain why 2.0 works
Hmmm, it still sounds like you are hitting a threading issue that is
probably exacerbated by the multicore platform of the newer machine.
Exactly what I was thinking.
What are the details of the CPUs of these two systems?
Ah, good point. The bad machine is a dual-processor 1GHz G4 wind
Your errors seem to happen around the same area (~20K docs). If you
skip the first say ~18K docs does the error still happen? We need to
somehow narrow this down.
I'm trying to boil down the documents to a set which I can deploy on a
DVD-ROM, so I can move the same set around from machine to
Do you have another PPC machine to reproduce this on? (To rule out
bad RAM/hard-drive on the first one).
I'll dig up an old laptop and try it there.
Another thing to try is turning on the infoStream
(IndexWriter.setInfoStream(...)) and capture post the resulting log.
It will be very large
Another thing to try is turning on the infoStream
(IndexWriter.setInfoStream(...)) and capture post the resulting log.
It will be very large since it takes quite a while for the error to
occur...
I can do that.
Here's what I see:
Optimizing...
merging segments _ram_a (1 docs) _ram_b
Another thing to try is turning on the infoStream
(IndexWriter.setInfoStream(...)) and capture post the resulting log.
It will be very large since it takes quite a while for the error to
occur...
I can do that.
Here's a more complete dump. I've modified the code so that I now
remove
Can you try running with the trunk version of Lucene (2.3-dev) and see
if the error still occurs? EG you can download this AM's build here:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/288/artifact/artifacts
Still there. Here's the dump with last night's build:
Are you still getting the original exception too or just the Array out =20=
of bounds one now? Also, are you doing anything else to the index =20
while this is happening? The code at the point in the exception below =20=
is trying to properly handle deleted documents.
Just the
Could you post this part of the code (deleting) too?
Here it is:
private static void remove (File index_file, String[] doc_ids, int start) {
String number;
String list;
Term term;
TermDocs matches;
if (debug_mode)
Have you tried another PPC machine?
No. It's in another location, but perhaps I can get it tomorrow. On
the other hand, the success when using 2.0 makes it likely to me that
the machine isn't the problem.
OK, I've reverted to my original codebase (where I first create a
reader and do the
Also, could you try out the CheckIndex tool in 2.3-dev before and
after the deletes?
Great idea! I don't suppose there's a jar file of it?
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
So, it's a little clearer. I get the Array-out-of-bounds exception if
I'm re-indexing some already indexed documents -- if there are
deletions involved. I get the CorruptIndexException if I'm indexing
freshly -- no deletions. Here's an example of that (with the latest
nightly). I removed the
Here's the code I'm using:
try {
// Now add the documents to the index
IndexWriter writer = new IndexWriter(index_loc, new
StandardAnalyzer(), !index_loc.exists());
writer.setMaxFieldLength(Integer.MAX_VALUE);
try {
for
I just tried re-indexing with lucene-core-2.0.0.jar and the same
indexing code; works great. So what am I doing wrong with 2.2?
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Are you really sure in your 2.2 test you are starting with no prior
index?
I'd ask that too, but yes, I'm really really sure. Building a
completely new index each time.
Works with 2.0.0. Fails with 2.2.0. Works with 2.2.0 *if* I remove
the optimization step.
Bill
You are not hitting any other exception before this one right?
Can you change your test case so that the catch clause is run
before the finally clause? I wonder if you are hitting some
interesting exception and then trying to optimize, which then
masks the original exception.
Yes, I
I'm going to run the same software on an
Intel machine and see what happens.
So, I ran the same codebase with lucene-core-2.2.0.jar on an Intel Mac
Pro, OS X 10.5.0, Java 1.5, and no exception is raised. Different
corpus, about 5 pages instead of 2. This is reinforcing my
thinking
Hmmm ... how many chunks of about 50 pages do you do before hitting this?
Roughly how many docs are in the index when it happens?
Oh, gosh, not sure. I'm guessing it's about half done.
Can you describe the docs/fields you're adding?
I've got 1735 documents, 18969 pages -- average page size
Lucille apparently doesn't require gcj.
Bill
Why Lucille in light of PyLucene?
Erik
On Aug 28, 2007, at 10:55 AM, Dan Callaghan wrote:
Dear list,
I have recently begun a Python port of Lucene, named Lucille. It is
still very much a work in progress, but I hope to have a
Dawid Weiss wrote:
You could also try splitting the document into paragraphs and use Carrot2's
Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters.
Labelling routine in Lingo should extract 'key' phrases; this analysis is
heavily frequency-based, but... you know,
Dawid Weiss wrote:
You could also try splitting the document into paragraphs and use Carrot2's
Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters.
Labelling routine in Lingo should extract 'key' phrases; this analysis is
heavily frequency-based, but... you know, you
docfreqs (idfs) do not take into account deleted docs.
This is more of an engineering tradeoff rather than a feature.
If we could cheaply and easily update idfs when documents are deleted
from an index, we would.
Wow. So is it fair to say that the stored IDF is really the
cumulative IDF for
The difference between IndexReader.maxDoc() and numDocs() tells you
how many documents have been marked for deletion but still take up
space in the index.
But not which terms have an odd IDF value because of those deleted
documents. How much does the IDF value contribute to the score in
Try looking at the retrieveInterestingTerms method on the class MoreLikeThis.
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
For example, given terms female, John and London - all 3 may
have equal IDF but should a document representing a female in London
be given equal weighting to a document representing the rarer example
of a female who happens to be called John?
Not to mention multi-word phrase tokenization,
MoreLikeThis is just what I wanted. Thanks.
Bill
Yes, I believe Dave did something like that on searchmorph.org and somebody=
else did this on some some with RFCs. What's that called? Query by examp=
le? I think so, try define:Query By Example on Google.
Take a look at=
MoreLikeThis
I was thinking of trying something, and wondered if someone else
already had it working...
I'd like to take a document, and use it as a query to find other
documents in my index that 'match' it. I'm talking about short
documents, like newspaper articles or email messages. Seems to me
that there
I'd like to suggest a minor change in the QueryParser.jj. I thought
I'd describe it here and get some feedback before posting a patch.
The issue is that I can't get my hands on some clauses (typically
PhraseQuery instances) in my subclass of MultiFieldQueryParser, which
I'd like to do to
Hi!
I've got oodles of email stored in MH (one file per message,
hierarchical directories) format. I'm looking for an IMAP server that
will use Lucene to index that mail and perform the various search
parts of the IMAP protocol. Ideally, the mail would not have to be
converted to another email
In case anyone else was wondering:
I got curious about how one would replace FieldCache, and discovered
that you can create an instance of a class which implements
FieldCache, and then simply assign it to
org.apache.lucene.search.FieldCache.DEFAULT.
2) your use case sounds like it could best be
Hi.
Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded
and there is no way to set it from outside?
I've seen a check-in in the CVS from a few days ago which added
getters/setters for this, but ... there is no release containing
this, right?
So, my question is:
I presume the patch that gives us a way of overriding the default
timeout for write locks has made it into the source DB, but I really
need a jar file to point people at which contains it. Any chance of
a 1.9.2 release?
Bill
-
Let's stop this thread.
Can i use lucene to search the internet.
No.
You may be able to use Lucene to *index* the internet, and then search
the resulting index. Read the book Lucene in Action for a better idea
of what this would entail.
Bill
Daniel Naber ponders:
Seems these have been forgotten. They can easily be added, but I still
wonder what the use case is to set these values?
The default value isn't magic. The appropriate value is
context-specific. I've got some people using Lucene on machines with
slow disks, and we need
I don't see how to adjust the value of IndexWriter's
WRITE_LOCK_TIMEOUT in 1.9. Since the property
org.apache.lucene.writeLockTimeout is no longer consulted, the value
of IndexWriter.WRITE_LOCK_TIMEOUT is final, and there's no setter,
what's the deal?
Bill
I've got a daemon process which keeps an IndexSearcher open on an
index and responds to query requests by sending back document
identifiers. I've also got other processes updating the index by
re-indexing existing documents, deleting obsolete documents, and
adding new documents. Is there any way
I've got a some code developed for Lucene 1.4.1, that works around the
problem of having both (1) multiple default fields, and (2) the AND
operator for query elements. In 1.4.1, MultiFieldQueryParser
effectively only allowed the OR operator.
I'm wondering if this has changed in 1.9. Will I be
Thanks for pointing this out, Marvin. I wish Sun (or someone) would
document and register this particular character set encoding with
IANA, so that it could be used outside of Java. As it stands now,
it's essentially a bastard encoding, good for nothing, and one of the
warts of Java.
Lucene
I've got a situation where I'm searching over a number of different
repositories, each containing a different set of documents. I'd like
to run searches over, say, 4 different indices, then combine the
results outside of Java to present to the user. Is there any way of
normalizing search scores
58 matches
Mail list logo