Profiling lucene 5.2.0 based tool

2016-02-22 Thread sandeep das
Hi,

I've implemented a tool using lucene-5.2.0 to index my CSV files. The tool
is reading data from CSV files(residing on disk) and creating indexes on
local disk. It is able to process 3.5 MBps data. There are overall 46
fields being added in one document. They are only of three data types 1.
Integer, 2. Long, 3. String.
All these fields are part of one CSV record and they are parsed using
custom CSV parser which is faster than any split method of string.

I've configured the following parameters to create indexWriter
1. setOpenMode(OpenMode.CREATE)
2. setCommitOnClose(true)
3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
almost same.

I've read over several blogs that lucene works way faster than these
figures. So, I thought there are some bottlenecks in my code and profiled
it using jvisualvm. The application is spending most of the time in
DefaultIndexChain.processField i.e. 53% of total time.


Following is the split of CPU usage in this application:
1. reading data from disk is taking 5% of total duration
2. adding document is taking 93% of total duration.

   -postUpdate  -> 12.8%
   -doAfterDocument -> 20.6%
   -updateDocument  -> 59.8%
  - finishDocument -> 1.7%
  - finishStoreFields -> 4.8%
  - processFields -> 53.1%


I'm also attaching the screen shot of call graph generated by jvisualvm.

I've taken care of following points:
1. create only one instance of indexWriter
2. create only one instance of document and reuse it through out the life
time of application
3. There will be no update in the documents hence only addDocument is
invoked.
Note: After going through the code I found out that addDocument is
internally calling updateDocument only. Is there any way by which we can
avoid calling updateDocument and only use addDocument API?
4. Using setValue APIs to set the pre created fields and reusing these
fields to create indexes.

Any tip to improve the performance will be immensely appreciated.

Regards,
Sandeep

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Profiling lucene 5.2.0 based tool

2016-02-22 Thread sandeep das
Hi Rob,

The statistics which I had shared were provided using one thread for
indexing. I wish to use only 1 thread and want to process maximum
10MBps(Mega Bytes per second) of data rate. I believe with single thread it
should be achievable.

Regards,
Sandeep

On Tue, Feb 23, 2016 at 12:50 PM, Rob Audenaerde 
wrote:

> Hi Sandeep,
>
> How many threads do you use to do the indexing? The benchmarks of Lucene
> are done on >20 threads IIRC.
>
> -Rob
>
> On Tue, Feb 23, 2016 at 8:01 AM, sandeep das  wrote:
>
> > Hi,
> >
> > I've implemented a tool using lucene-5.2.0 to index my CSV files. The
> tool
> > is reading data from CSV files(residing on disk) and creating indexes on
> > local disk. It is able to process 3.5 MBps data. There are overall 46
> > fields being added in one document. They are only of three data types 1.
> > Integer, 2. Long, 3. String.
> > All these fields are part of one CSV record and they are parsed using
> > custom CSV parser which is faster than any split method of string.
> >
> > I've configured the following parameters to create indexWriter
> > 1. setOpenMode(OpenMode.CREATE)
> > 2. setCommitOnClose(true)
> > 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> > almost same.
> >
> > I've read over several blogs that lucene works way faster than these
> > figures. So, I thought there are some bottlenecks in my code and profiled
> > it using jvisualvm. The application is spending most of the time in
> > DefaultIndexChain.processField i.e. 53% of total time.
> >
> >
> > Following is the split of CPU usage in this application:
> > 1. reading data from disk is taking 5% of total duration
> > 2. adding document is taking 93% of total duration.
> >
> >-postUpdate  -> 12.8%
> >-doAfterDocument -> 20.6%
> >-updateDocument  -> 59.8%
> >   - finishDocument -> 1.7%
> >   - finishStoreFields -> 4.8%
> >   - processFields -> 53.1%
> >
> >
> > I'm also attaching the screen shot of call graph generated by jvisualvm.
> >
> > I've taken care of following points:
> > 1. create only one instance of indexWriter
> > 2. create only one instance of document and reuse it through out the life
> > time of application
> > 3. There will be no update in the documents hence only addDocument is
> > invoked.
> > Note: After going through the code I found out that addDocument is
> > internally calling updateDocument only. Is there any way by which we can
> > avoid calling updateDocument and only use addDocument API?
> > 4. Using setValue APIs to set the pre created fields and reusing these
> > fields to create indexes.
> >
> > Any tip to improve the performance will be immensely appreciated.
> >
> > Regards,
> > Sandeep
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>


Re: Profiling lucene 5.2.0 based tool

2016-02-23 Thread sandeep das
Thanks a lot guys. I really appreciate your response on my query. I'll
create multiple threads and checkout that how much I can rate can be
increased per thread.


Regards,
Sandeep

On Tue, Feb 23, 2016 at 4:19 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Your profiler breakdown is exactly what I'd expect: processing the
> fields is the heaviest part of indexing.
>
> Except, it doesn't have any merges?  Did you run it for long enough?
> Note that by default Lucene runs merges in a background thread
> (ConcurrentMergeScheduler).  If you really must be single thread'd
> (why?) then you should use SerialMergeScheduler instead.
>
> The doAfterDocument is likely the flush time (writing the new segment
> once the in-heap indexing buffer is full).
>
> Finally, if many of your fields are numeric, 6.0 offers some nice
> improvements here with the new dimensional points feature.  See
> https://www.elastic.co/blog/lucene-points-6.0 ... but not 6.0 is not
> yet released though it should be soon now.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 23, 2016 at 2:01 AM, sandeep das  wrote:
> > Hi,
> >
> > I've implemented a tool using lucene-5.2.0 to index my CSV files. The
> tool
> > is reading data from CSV files(residing on disk) and creating indexes on
> > local disk. It is able to process 3.5 MBps data. There are overall 46
> fields
> > being added in one document. They are only of three data types 1.
> Integer,
> > 2. Long, 3. String.
> > All these fields are part of one CSV record and they are parsed using
> custom
> > CSV parser which is faster than any split method of string.
> >
> > I've configured the following parameters to create indexWriter
> > 1. setOpenMode(OpenMode.CREATE)
> > 2. setCommitOnClose(true)
> > 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> > almost same.
> >
> > I've read over several blogs that lucene works way faster than these
> > figures. So, I thought there are some bottlenecks in my code and
> profiled it
> > using jvisualvm. The application is spending most of the time in
> > DefaultIndexChain.processField i.e. 53% of total time.
> >
> >
> > Following is the split of CPU usage in this application:
> > 1. reading data from disk is taking 5% of total duration
> > 2. adding document is taking 93% of total duration.
> >
> >postUpdate  -> 12.8%
> >doAfterDocument -> 20.6%
> >updateDocument  -> 59.8%
> >
> > finishDocument -> 1.7%
> > finishStoreFields -> 4.8%
> > processFields -> 53.1%
> >
> >
> > I'm also attaching the screen shot of call graph generated by jvisualvm.
> >
> > I've taken care of following points:
> > 1. create only one instance of indexWriter
> > 2. create only one instance of document and reuse it through out the life
> > time of application
> > 3. There will be no update in the documents hence only addDocument is
> > invoked.
> > Note: After going through the code I found out that addDocument is
> > internally calling updateDocument only. Is there any way by which we can
> > avoid calling updateDocument and only use addDocument API?
> > 4. Using setValue APIs to set the pre created fields and reusing these
> > fields to create indexes.
> >
> > Any tip to improve the performance will be immensely appreciated.
> >
> > Regards,
> > Sandeep
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>


Compression technique for stored fields

2016-02-23 Thread sandeep das
Hi guys,

While running my application I noticed that the lz4 is used as compression
technique for stored field. Is there any option by which I can change it to
snappy?

Regards,
Sandeep


Lucene-5.2.0 on HDFS

2016-02-28 Thread sandeep das
Hi All,

I was trying to create indexes on HDFSDirectory. So I tried to use
lucene-hdfs-directory-4.7.0 but it seems to be incompatible with
lucene-5.2.0. The class HdfsDirectory requires to create an instance of
BufferedIndexOutput which has been deprecated in the lucene-5.2.0 hence my
application is repeatedly failing with NoClassDefFoundError exception.
Following is the stack trace:

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/store/BufferedIndexOutput
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at com.poc.lucene.core.Indexer2.startIndexing(Indexer2.java:102)
at com.poc.lucene.core.Indexer2.indexFiles(Indexer2.java:72)
at com.poc.lucene.client.IndexFieldClient.main(IndexFieldClient.java:17)



Can someone guide me to sort out this problem? Or help to understand the
correct way to create indexes on HDFS using lucene? Is Solr the only way to
create indexes on HDFS using Lucene?

Thanks in advance.


Regards,
Sandeep


Creating composite query in lucene

2016-03-08 Thread sandeep das
Hi,

I'm using lucene-5.2.0 and in query interface I wish to compose a query
like
"a=x and (b=y or d=z)"


Which can be described as If any document has value "x" for field "a" and
field "b" has value "y" or field "d" has value "z" then that document
should be chosen. There are three fields in my document i.e. a, b and c.

I was thinking of using BooleanQuery object to implement such query but it
seems difficult to implement the same.

If I write the above clause in terms of boolean query then this is the best
I can think of:

((BooleanQuery)query).add(new RegexpQuery(new Term("a", "x")),
BooleanClause.Occur.MUST);

((BooleanQuery)query).add(new RegexpQuery(new Term("b", "y")),
BooleanClause.Occur.SHOULD);

((BooleanQuery)query).add(new RegexpQuery(new Term("c", "z")),
BooleanClause.Occur.SHOULD);


But in the above case a document will be selected even if it does not have
value of field "b" as "y" or value of field "c" as "z" but has value of
field "a" as "x". The OR condition might be ignored.

Correct me If my understanding is wrong here.

Can someone please suggest some better solution to compose such query?

Regards,
Sandeep


Re: Creating composite query in lucene

2016-03-09 Thread sandeep das
Hi Jack,

Thanks a lot for your suggestion.

Regards,
Sandeep

On Tue, Mar 8, 2016 at 8:32 PM, Jack Krupansky 
wrote:

> BooleanQuery can be nested, so you do a top-level BQ that has two clauses,
> the first a TQ for a:x and the second another BQ that itself has two
> clauses, both SHOULD.
>
> -- Jack Krupansky
>
> On Tue, Mar 8, 2016 at 4:38 AM, sandeep das  wrote:
>
> > Hi,
> >
> > I'm using lucene-5.2.0 and in query interface I wish to compose a query
> > like
> > "a=x and (b=y or d=z)"
> >
> >
> > Which can be described as If any document has value "x" for field "a" and
> > field "b" has value "y" or field "d" has value "z" then that document
> > should be chosen. There are three fields in my document i.e. a, b and c.
> >
> > I was thinking of using BooleanQuery object to implement such query but
> it
> > seems difficult to implement the same.
> >
> > If I write the above clause in terms of boolean query then this is the
> best
> > I can think of:
> >
> > ((BooleanQuery)query).add(new RegexpQuery(new Term("a", "x")),
> > BooleanClause.Occur.MUST);
> >
> > ((BooleanQuery)query).add(new RegexpQuery(new Term("b", "y")),
> > BooleanClause.Occur.SHOULD);
> >
> > ((BooleanQuery)query).add(new RegexpQuery(new Term("c", "z")),
> > BooleanClause.Occur.SHOULD);
> >
> >
> > But in the above case a document will be selected even if it does not
> have
> > value of field "b" as "y" or value of field "c" as "z" but has value of
> > field "a" as "x". The OR condition might be ignored.
> >
> > Correct me If my understanding is wrong here.
> >
> > Can someone please suggest some better solution to compose such query?
> >
> > Regards,
> > Sandeep
> >
>