Re: Re: UNIX command-line indexing script?

2004-03-30 Thread Linto Joseph Mathew
charlie,

i wrote this in java.Ofcourse I am ready to share. But i have some problems when 
indexing large volume of data. I am under testing.

Linto


 


On Fri, 26 Mar 2004 Charlie Smith wrote :
>So, Linto,
>
>  Did you write this in PERL or JAVA.  Would you be willing to part with copy of
>source?
>
>
>
> >Linto wrote on 3/16/04
>
> >I  have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text
>files. I wrote this based on >demo application and using other
> >open soure componets POI by Apache (for doc and exel) and PDFBox. I modified
>client interface also. Now i>ts looks like google. Still i have to do a couple
>of things.
>  > 1) At present i'm using UNIX 'file' command to check it is plain text.
>   >   This will spwan process and take more time. The advantage this is
>in unix based mechines where file >extention is not important.( it uses
>magic numbers. )
> >  2) The information such as Index Location, Directory, URL, etc. should
>be kept in an xml file. So that it >cam be dynamic.
> >  3) Categeory
> >
> >
> >Since apache guys provided good frame work every thing made easy. Thanks
>guys!
> >
>
> >Linto
>
>
>
>
>On Sat, 13 Mar 2004 Charlie Smith wrote :
> >Anyone written a simple UNIX command-line indexing script which will read a
> >bunch off different kinds of docs and index them?  I'd like to make a cron
>job
> >out of this so as to be able to come back and read it later during a search.
> >
> >PERL or JAVA script would be fine.
> >
> >
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>


Re: UNIX command-line indexing script?

2004-03-26 Thread Charlie Smith
So, Linto,

 Did you write this in PERL or JAVA.  Would you be willing to part with copy of
source?



>Linto wrote on 3/16/04

>I  have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text
files. I wrote this based on >demo application and using other 
>open soure componets POI by Apache (for doc and exel) and PDFBox. I modified
client interface also. Now i>ts looks like google. Still i have to do a couple
of things.
 > 1) At present i'm using UNIX 'file' command to check it is plain text.
  >   This will spwan process and take more time. The advantage this is   
in unix based mechines where file >extention is not important.( it uses
magic numbers. )
>  2) The information such as Index Location, Directory, URL, etc. should 
be kept in an xml file. So that it >cam be dynamic.
>  3) Categeory 
>  
>
>Since apache guys provided good frame work every thing made easy. Thanks
guys!
>

>Linto




On Sat, 13 Mar 2004 Charlie Smith wrote :
>Anyone written a simple UNIX command-line indexing script which will read a
>bunch off different kinds of docs and index them?  I'd like to make a cron
job
>out of this so as to be able to come back and read it later during a search.
>
>PERL or JAVA script would be fine.
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: UNIX command-line indexing script?

2004-03-16 Thread Linto Joseph Mathew

I  have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I 
wrote this based on demo application and using other 
open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client 
interface also. Now its looks like google. Still i have to do a couple of things.
  1) At present i'm using UNIX 'file' command to check it is plain text.
 This will spwan process and take more time. The advantage this isin unix 
based mechines where file extention is not important.( it uses magic numbers. )
  2) The information such as Index Location, Directory, URL, etc. should  be kept 
in an xml file. So that it cam be dynamic.
  3) Categeory 
  

Since apache guys provided good frame work every thing made easy. Thanks guys!


Linto




On Sat, 13 Mar 2004 Charlie Smith wrote :
>Anyone written a simple UNIX command-line indexing script which will read a
>bunch off different kinds of docs and index them?  I'd like to make a cron job
>out of this so as to be able to come back and read it later during a search.
>
>PERL or JAVA script would be fine.
>
>


Re: UNIX command-line indexing script?

2004-03-15 Thread Otis Gospodnetic
Erik and I are putting finishing touches on it, so by Summer (this one
;)).

Otis

--- Charlie Smith <[EMAIL PROTECTED]> wrote:
> So, how upcoming is this book going to be?
> 
> >>> [EMAIL PROTECTED] 3/15/2004 3:39:39 AM >>>
> To add to this.
> The upcoming Lucene in Action book has ready to use code that will
> handle and index files in most popular file formats.
> 
> Otis
> 
> --- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> > Have a look at the Ant  task in the Lucene sandbox.  You're
> on
> > 
> > your own, currently, to build this and understand it, but I use it 
> > frequently.  In fact, the sample index from our book is generated
> > with 
> > this:
> > 
> >   >documenthandler="lia.common.TestDataDocumentHandler">
> >
> >
> >  
> > 
> > You can plug in your own DocumentHandler implementation to index 
> > different document types however you like.  The default one indexes
> 
> > .txt and .html files, but a custom implementation can do its own
> > thing. 
> >   Again, to write a DocumentHandler that knows about various
> document
> > 
> > types is not hard you will have to write your own at the moment.
> > 
> > Despite the (minor) amount of work you'll have to do to start using
> 
> >  - the infrastructure adds a lot of value: an incremental
> file
> > 
> > system indexer (only new docs get indexed on successive runs).  
> > Plugging this into cron would be trivial.
> > 
> > Erik
> > 
> > On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote:
> > 
> > > Anyone written a simple UNIX command-line indexing script which
> > will 
> > > read a
> > > bunch off different kinds of docs and index them?  I'd like to
> make
> > a 
> > > cron job
> > > out of this so as to be able to come back and read it later
> during
> > a 
> > > search.
> > >
> > > PERL or JAVA script would be fine.
> > >
> > >
> > 
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED] 
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: UNIX command-line indexing script?

2004-03-15 Thread Charlie Smith
So, how upcoming is this book going to be?

>>> [EMAIL PROTECTED] 3/15/2004 3:39:39 AM >>>
To add to this.
The upcoming Lucene in Action book has ready to use code that will
handle and index files in most popular file formats.

Otis

--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> Have a look at the Ant  task in the Lucene sandbox.  You're on
> 
> your own, currently, to build this and understand it, but I use it 
> frequently.  In fact, the sample index from our book is generated
> with 
> this:
> 
>  documenthandler="lia.common.TestDataDocumentHandler">
>
>
>  
> 
> You can plug in your own DocumentHandler implementation to index 
> different document types however you like.  The default one indexes 
> .txt and .html files, but a custom implementation can do its own
> thing. 
>   Again, to write a DocumentHandler that knows about various document
> 
> types is not hard you will have to write your own at the moment.
> 
> Despite the (minor) amount of work you'll have to do to start using 
>  - the infrastructure adds a lot of value: an incremental file
> 
> system indexer (only new docs get indexed on successive runs).  
> Plugging this into cron would be trivial.
> 
>   Erik
> 
> On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote:
> 
> > Anyone written a simple UNIX command-line indexing script which
> will 
> > read a
> > bunch off different kinds of docs and index them?  I'd like to make
> a 
> > cron job
> > out of this so as to be able to come back and read it later during
> a 
> > search.
> >
> > PERL or JAVA script would be fine.
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: UNIX command-line indexing script?

2004-03-15 Thread Otis Gospodnetic
To add to this.
The upcoming Lucene in Action book has ready to use code that will
handle and index files in most popular file formats.

Otis

--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> Have a look at the Ant  task in the Lucene sandbox.  You're on
> 
> your own, currently, to build this and understand it, but I use it 
> frequently.  In fact, the sample index from our book is generated
> with 
> this:
> 
>  documenthandler="lia.common.TestDataDocumentHandler">
>
>
>  
> 
> You can plug in your own DocumentHandler implementation to index 
> different document types however you like.  The default one indexes 
> .txt and .html files, but a custom implementation can do its own
> thing. 
>   Again, to write a DocumentHandler that knows about various document
> 
> types is not hard you will have to write your own at the moment.
> 
> Despite the (minor) amount of work you'll have to do to start using 
>  - the infrastructure adds a lot of value: an incremental file
> 
> system indexer (only new docs get indexed on successive runs).  
> Plugging this into cron would be trivial.
> 
>   Erik
> 
> On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote:
> 
> > Anyone written a simple UNIX command-line indexing script which
> will 
> > read a
> > bunch off different kinds of docs and index them?  I'd like to make
> a 
> > cron job
> > out of this so as to be able to come back and read it later during
> a 
> > search.
> >
> > PERL or JAVA script would be fine.
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: UNIX command-line indexing script?

2004-03-15 Thread Erik Hatcher
Have a look at the Ant  task in the Lucene sandbox.  You're on 
your own, currently, to build this and understand it, but I use it 
frequently.  In fact, the sample index from our book is generated with 
this:


  
  

You can plug in your own DocumentHandler implementation to index 
different document types however you like.  The default one indexes 
.txt and .html files, but a custom implementation can do its own thing. 
 Again, to write a DocumentHandler that knows about various document 
types is not hard you will have to write your own at the moment.

Despite the (minor) amount of work you'll have to do to start using 
 - the infrastructure adds a lot of value: an incremental file 
system indexer (only new docs get indexed on successive runs).  
Plugging this into cron would be trivial.

	Erik

On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote:

Anyone written a simple UNIX command-line indexing script which will 
read a
bunch off different kinds of docs and index them?  I'd like to make a 
cron job
out of this so as to be able to come back and read it later during a 
search.

PERL or JAVA script would be fine.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


UNIX command-line indexing script?

2004-03-13 Thread Charlie Smith
Anyone written a simple UNIX command-line indexing script which will read a
bunch off different kinds of docs and index them?  I'd like to make a cron job
out of this so as to be able to come back and read it later during a search.
 
PERL or JAVA script would be fine.