Re: Re: UNIX command-line indexing script?
charlie, i wrote this in java.Ofcourse I am ready to share. But i have some problems when indexing large volume of data. I am under testing. Linto On Fri, 26 Mar 2004 Charlie Smith wrote : >So, Linto, > > Did you write this in PERL or JAVA. Would you be willing to part with copy of >source? > > > > >Linto wrote on 3/16/04 > > >I have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text >files. I wrote this based on >demo application and using other > >open soure componets POI by Apache (for doc and exel) and PDFBox. I modified >client interface also. Now i>ts looks like google. Still i have to do a couple >of things. > > 1) At present i'm using UNIX 'file' command to check it is plain text. > > This will spwan process and take more time. The advantage this is >in unix based mechines where file >extention is not important.( it uses >magic numbers. ) > > 2) The information such as Index Location, Directory, URL, etc. should >be kept in an xml file. So that it >cam be dynamic. > > 3) Categeory > > > > > >Since apache guys provided good frame work every thing made easy. Thanks >guys! > > > > >Linto > > > > >On Sat, 13 Mar 2004 Charlie Smith wrote : > >Anyone written a simple UNIX command-line indexing script which will read a > >bunch off different kinds of docs and index them? I'd like to make a cron >job > >out of this so as to be able to come back and read it later during a search. > > > >PERL or JAVA script would be fine. > > > > > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] >
Re: UNIX command-line indexing script?
So, Linto, Did you write this in PERL or JAVA. Would you be willing to part with copy of source? >Linto wrote on 3/16/04 >I have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I wrote this based on >demo application and using other >open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client interface also. Now i>ts looks like google. Still i have to do a couple of things. > 1) At present i'm using UNIX 'file' command to check it is plain text. > This will spwan process and take more time. The advantage this is in unix based mechines where file >extention is not important.( it uses magic numbers. ) > 2) The information such as Index Location, Directory, URL, etc. should be kept in an xml file. So that it >cam be dynamic. > 3) Categeory > > >Since apache guys provided good frame work every thing made easy. Thanks guys! > >Linto On Sat, 13 Mar 2004 Charlie Smith wrote : >Anyone written a simple UNIX command-line indexing script which will read a >bunch off different kinds of docs and index them? I'd like to make a cron job >out of this so as to be able to come back and read it later during a search. > >PERL or JAVA script would be fine. > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
I have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I wrote this based on demo application and using other open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client interface also. Now its looks like google. Still i have to do a couple of things. 1) At present i'm using UNIX 'file' command to check it is plain text. This will spwan process and take more time. The advantage this isin unix based mechines where file extention is not important.( it uses magic numbers. ) 2) The information such as Index Location, Directory, URL, etc. should be kept in an xml file. So that it cam be dynamic. 3) Categeory Since apache guys provided good frame work every thing made easy. Thanks guys! Linto On Sat, 13 Mar 2004 Charlie Smith wrote : >Anyone written a simple UNIX command-line indexing script which will read a >bunch off different kinds of docs and index them? I'd like to make a cron job >out of this so as to be able to come back and read it later during a search. > >PERL or JAVA script would be fine. > >
Re: UNIX command-line indexing script?
Erik and I are putting finishing touches on it, so by Summer (this one ;)). Otis --- Charlie Smith <[EMAIL PROTECTED]> wrote: > So, how upcoming is this book going to be? > > >>> [EMAIL PROTECTED] 3/15/2004 3:39:39 AM >>> > To add to this. > The upcoming Lucene in Action book has ready to use code that will > handle and index files in most popular file formats. > > Otis > > --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > > Have a look at the Ant task in the Lucene sandbox. You're > on > > > > your own, currently, to build this and understand it, but I use it > > frequently. In fact, the sample index from our book is generated > > with > > this: > > > > >documenthandler="lia.common.TestDataDocumentHandler"> > > > > > > > > > > You can plug in your own DocumentHandler implementation to index > > different document types however you like. The default one indexes > > > .txt and .html files, but a custom implementation can do its own > > thing. > > Again, to write a DocumentHandler that knows about various > document > > > > types is not hard you will have to write your own at the moment. > > > > Despite the (minor) amount of work you'll have to do to start using > > > - the infrastructure adds a lot of value: an incremental > file > > > > system indexer (only new docs get indexed on successive runs). > > Plugging this into cron would be trivial. > > > > Erik > > > > On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: > > > > > Anyone written a simple UNIX command-line indexing script which > > will > > > read a > > > bunch off different kinds of docs and index them? I'd like to > make > > a > > > cron job > > > out of this so as to be able to come back and read it later > during > > a > > > search. > > > > > > PERL or JAVA script would be fine. > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
So, how upcoming is this book going to be? >>> [EMAIL PROTECTED] 3/15/2004 3:39:39 AM >>> To add to this. The upcoming Lucene in Action book has ready to use code that will handle and index files in most popular file formats. Otis --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > Have a look at the Ant task in the Lucene sandbox. You're on > > your own, currently, to build this and understand it, but I use it > frequently. In fact, the sample index from our book is generated > with > this: > > documenthandler="lia.common.TestDataDocumentHandler"> > > > > > You can plug in your own DocumentHandler implementation to index > different document types however you like. The default one indexes > .txt and .html files, but a custom implementation can do its own > thing. > Again, to write a DocumentHandler that knows about various document > > types is not hard you will have to write your own at the moment. > > Despite the (minor) amount of work you'll have to do to start using > - the infrastructure adds a lot of value: an incremental file > > system indexer (only new docs get indexed on successive runs). > Plugging this into cron would be trivial. > > Erik > > On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: > > > Anyone written a simple UNIX command-line indexing script which > will > > read a > > bunch off different kinds of docs and index them? I'd like to make > a > > cron job > > out of this so as to be able to come back and read it later during > a > > search. > > > > PERL or JAVA script would be fine. > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
To add to this. The upcoming Lucene in Action book has ready to use code that will handle and index files in most popular file formats. Otis --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > Have a look at the Ant task in the Lucene sandbox. You're on > > your own, currently, to build this and understand it, but I use it > frequently. In fact, the sample index from our book is generated > with > this: > > documenthandler="lia.common.TestDataDocumentHandler"> > > > > > You can plug in your own DocumentHandler implementation to index > different document types however you like. The default one indexes > .txt and .html files, but a custom implementation can do its own > thing. > Again, to write a DocumentHandler that knows about various document > > types is not hard you will have to write your own at the moment. > > Despite the (minor) amount of work you'll have to do to start using > - the infrastructure adds a lot of value: an incremental file > > system indexer (only new docs get indexed on successive runs). > Plugging this into cron would be trivial. > > Erik > > On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: > > > Anyone written a simple UNIX command-line indexing script which > will > > read a > > bunch off different kinds of docs and index them? I'd like to make > a > > cron job > > out of this so as to be able to come back and read it later during > a > > search. > > > > PERL or JAVA script would be fine. > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
Have a look at the Ant task in the Lucene sandbox. You're on your own, currently, to build this and understand it, but I use it frequently. In fact, the sample index from our book is generated with this: You can plug in your own DocumentHandler implementation to index different document types however you like. The default one indexes .txt and .html files, but a custom implementation can do its own thing. Again, to write a DocumentHandler that knows about various document types is not hard you will have to write your own at the moment. Despite the (minor) amount of work you'll have to do to start using - the infrastructure adds a lot of value: an incremental file system indexer (only new docs get indexed on successive runs). Plugging this into cron would be trivial. Erik On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: Anyone written a simple UNIX command-line indexing script which will read a bunch off different kinds of docs and index them? I'd like to make a cron job out of this so as to be able to come back and read it later during a search. PERL or JAVA script would be fine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
UNIX command-line indexing script?
Anyone written a simple UNIX command-line indexing script which will read a bunch off different kinds of docs and index them? I'd like to make a cron job out of this so as to be able to come back and read it later during a search. PERL or JAVA script would be fine.