Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-02 Thread gudiseashok
Thank you very much for your time sir, I follow your suggestion.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4093136.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-02 Thread Ian Lea
Yes, as I suggested, you could search on your unique id and not index
if already present.  Or, as Uwe suggested, call updateDocument instead
of add, again using the unique id.


--
Ian.


On Tue, Oct 1, 2013 at 6:41 PM, gudiseashok  wrote:
> I am really sorry if something made you confuse, as I said I am indexing a
> folder
> which contains mylogs.log,mylogs1.log,mylogs2.log etc, I am not indexing
> them as a flat file.
> I have tokenized my each line of text with regex and storing them as fields
> like "messageType",
> "timeStamp","message".
>
> So I dont bother what file among those 4 files having this particular
> content but, I just want to insert only new records.
> My job routine will update these log files for every 30 minutes, and storing
> each row as document. So when I reading the files after 30 minutes for
> indexing,mylogs1.log content will previous version of mylog.log content. So
> If a row exists with the same data,
> So If I want to eliminate writing same record (from other file among those
> 4) again,
> Could you please suggest what do I need to do while calling add or
> updateDocument?
>
> Do I need to run seach before inserting any row or do I have any better way
> to eiliminate writing?
>
> I really appreciate your time reading this, and thanks for responding.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092990.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread gudiseashok
I am really sorry if something made you confuse, as I said I am indexing a
folder 
which contains mylogs.log,mylogs1.log,mylogs2.log etc, I am not indexing
them as a flat file.
I have tokenized my each line of text with regex and storing them as fields
like "messageType",
"timeStamp","message".

So I dont bother what file among those 4 files having this particular
content but, I just want to insert only new records.
My job routine will update these log files for every 30 minutes, and storing
each row as document. So when I reading the files after 30 minutes for
indexing,mylogs1.log content will previous version of mylog.log content. So
If a row exists with the same data,
So If I want to eliminate writing same record (from other file among those
4) again, 
Could you please suggest what do I need to do while calling add or
updateDocument?

Do I need to run seach before inserting any row or do I have any better way
to eiliminate writing?

I really appreciate your time reading this, and thanks for responding.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092990.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread Ian Lea
I'm still a bit confused about exactly what you're indexing, when, but
if you have a unique id and don't want to add or update a doc that's
already present, add the unique id to the index and search (TermQuery
probably) for each one and skip if already present.

Can't you change the log rotation/copying/indexing so that you only
index new data?

To start a fresh index, use IndexWriterConfig.OpenMode.CREATE.


--
Ian.


On Tue, Oct 1, 2013 at 4:51 PM, gudiseashok  wrote:
> Hi
>
> Basically my log folder consists of four log files like
> abc.log,abc1.log,abc2.log,abc3.log, as my log appender is doing. Every 30
> minutes content will be changed of all these file , for example after 30
> minutes refresh my conent of abc1.log will be replaced with existing abc.log
> content and abc.log will have new content (Timestamp is DD-MM- MM-ss:S).
> Since I am goingthrough the re-indexing for every 30 minutes, I dont want to
> re-index the same record which was already present with same timstamp.
>
> Also if I want to do clean-up for every week, (clean up in the sense I want
> to delete all indexes , and I want to do a fresh indexing for these 4
> files), how to do this efficiently.
>
> I really appreciate your time reading this, and kindly suggest a better way.
>
>
> Regards
> Ashok Gudise
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092963.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread gudiseashok
Hi 

Basically my log folder consists of four log files like
abc.log,abc1.log,abc2.log,abc3.log, as my log appender is doing. Every 30
minutes content will be changed of all these file , for example after 30
minutes refresh my conent of abc1.log will be replaced with existing abc.log
content and abc.log will have new content (Timestamp is DD-MM- MM-ss:S).
Since I am goingthrough the re-indexing for every 30 minutes, I dont want to
re-index the same record which was already present with same timstamp.  

Also if I want to do clean-up for every week, (clean up in the sense I want
to delete all indexes , and I want to do a fresh indexing for these 4
files), how to do this efficiently.

I really appreciate your time reading this, and kindly suggest a better way.


Regards
Ashok Gudise



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092963.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread Ian Lea
milliseconds as unique keys are a bad idea unless you are 100% certain
you'll never be creating 2 docs in the same millisecond.  And are you
saying the log record A1 from file a.log indexed at 14:00 will have
the same unique id as the same record from the same file indexed at
14:30 or will it be different?

If the same, you can use updateDocument as Uwe suggested.

If different, and you want to replace all the docs already indexed
from file a.log with the current contents of a.log, I suggest you
store the file name as an indexed field for each record from each file
and, when you reindex a file, start by calling
IndexWriter.deleteDocuments(Term t) where t is a Term that references
the file name.

--
Ian.


On Tue, Oct 1, 2013 at 2:20 PM, gudiseashok  wrote:
> I am afraid, my document in the above code has already a unique-key (will
> milli-seconds I hope this is enough to differentiate with another records).
>
> My requirement is simple, I have a folder with a.log,b.log and c.log files
> which will be updated every 30 minutes, I want to update the index of these
> files and re-indexing them. I am trying to explore the lucene-indexing but
> some how I am not able to get much help other than demo java files.
>
> Kindly suggest.
>
>
> Regards
> Ashok Gudise.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092934.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread gudiseashok
I am afraid, my document in the above code has already a unique-key (will
milli-seconds I hope this is enough to differentiate with another records).

My requirement is simple, I have a folder with a.log,b.log and c.log files
which will be updated every 30 minutes, I want to update the index of these
files and re-indexing them. I am trying to explore the lucene-indexing but
some how I am not able to get much help other than demo java files.

Kindly suggest. 


Regards
Ashok Gudise.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092934.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread Uwe Schindler
You have to call updateDocument with the unique key of the document to update. 
The unique key must be a separate, indexed, not necessarily stored key. 
addDocument just adds a new instance of the document to the index, it cannot 
determine if it’s a duplicate.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: gudiseashok [mailto:gudise.as...@gmail.com]
> Sent: Tuesday, October 01, 2013 2:01 AM
> To: java-user@lucene.apache.org
> Subject: Rendexing problem: Indexing folder size is keep on growing for
> same remote folder
> 
> Hi
> 
> I am reading log files a remote folder and keeping them in local folder and
> then I am indexing as I shown below,  how ever I am not saving the whole
> content as input stream, I am splitting them with Grok regex and saving as
> below strings.
> 
> I am repeating this "copy and indexing" process for every 30 minutes (as a
> cron job), and my index folder size is getting doubled on completion each
> batch. I am using Lucene 4.4 version, after looking at the addDocument
> implementation in IndexWriter, I assume that it is calling update though I 
> call
> addDocument with (CREATE_OR_APPEND).
> 
> Kindly suggest right approach if I am not doing in correct way, my
> requirement is to update that log folder content for every 30 minutes and re-
> indexing the content.
> 
> Thanks for taking time to read this, and please see my code configuration
> snippet below...
> 
> 
> //Adding document
>  document.add(new StringField("className", logsVO.getClassName(),
> Field.Store.YES));  document.add(new StringField("logLevel",
> logsVO.getLogLevel(), Field.Store.NO));  document.add(new
> TextField("logMessage", logsVO.getLogMessage(), Field.Store.YES));
> document.add(new StringField("messageType",
> logsVO.getMessageType().toString(), Field.Store.NO));  document.add(new
> LongField("timeStamp", logsVO.getTimeStamp().getTime(),
> Field.Store.YES));  IndexWriter writer =  luceneUtil.getIndexWriter();
> writer.addDocument(document);
> 
> //addDocument is calling IndexWriter's (API class)   below mentioned
> method,
> public void addDocument(Iterable doc) throws
> IOException {
> addDocument(doc, analyzer);
>   }
> 
> //Writer creation approach...
> 
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_44,
> analyzer);
> 
> iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
> this.writer = new IndexWriter(dir, iwc); ///
> 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-
> size-is-keep-on-growing-for-same-remote-folder-tp4092835.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-09-30 Thread gudiseashok
Hi 

I am reading log files a remote folder and keeping them in local folder and
then I am indexing as I shown below,  how ever I am not saving the whole
content as input stream, I am splitting them with Grok regex and saving as
below strings.

I am repeating this "copy and indexing" process for every 30 minutes (as a
cron job), and my index folder size is getting doubled on completion each
batch. I am using Lucene 4.4 version, after looking at the addDocument
implementation in IndexWriter, I assume that it is calling update though I
call addDocument with (CREATE_OR_APPEND). 

Kindly suggest right approach if I am not doing in correct way, my
requirement is to update that log folder content for every 30 minutes and
re-indexing the content.

Thanks for taking time to read this, and please see my code configuration
snippet below...


//Adding document
 document.add(new StringField("className", logsVO.getClassName(),
Field.Store.YES));
 document.add(new StringField("logLevel", logsVO.getLogLevel(),
Field.Store.NO));
 document.add(new TextField("logMessage", logsVO.getLogMessage(),
Field.Store.YES));
 document.add(new StringField("messageType",
logsVO.getMessageType().toString(), Field.Store.NO));
 document.add(new LongField("timeStamp", logsVO.getTimeStamp().getTime(),
Field.Store.YES));
 IndexWriter writer =  luceneUtil.getIndexWriter();
writer.addDocument(document);

//addDocument is calling IndexWriter's (API class)   below mentioned method,
public void addDocument(Iterable doc) throws
IOException {
addDocument(doc, analyzer);
  }

//Writer creation approach...

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_44,
analyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
this.writer = new IndexWriter(dir, iwc);
///





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org