"batch-update"-pattern, NoMergeScheduler?

2014-12-22 Thread Clemens Wyss DEV
One of our indexes is updated completely quite frequently -> "batch update" or 
"re-index". 
If so more than 2million documents are added/updated to/in the very index. This 
creates an immense IO load on our system. Does it make sense to set merge 
scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or is 
merging "not relevant" as the commit is done at the very end only?

Context information:
At the moment the writer's config consists only of setRAMBufferSizeMB:
IndexWriterConfig config = new IndexWriterConfig( 
IndexManager.CURRENT_LUCENE_VERSION, analyzer );
config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES );
//config.setMergeScheduler( NoMergeScheduler.INSTANCE );
config.setRAMBufferSizeMB( 20 );

The update logic is as follows:
indexWriter.deleteAll()
...
for all elements do {
...
indexWriter.updateDocument( term, doc ); // in order to omit "duplicate entries"
...
}
indexWriter.commit

What is the proposed way to perform such a batch update?


RE: BTRFS ?

2014-12-22 Thread Uwe Schindler
Hi Dawid,

there are cool things that might be useful just not for Lucene's Java code. 
Like ZFS it now has snapshot functionality and you can copy files mostly 
without doing I/O (shallow-copy, it uses copy-on-write semantics to do that). 
This might be useful for backup purposes. I know we did most of this already in 
the index file format, it might just be a good idea to investigate this e.g. 
for the snapshot/backup functionality of Elasticsearch. You can also quickly 
put an index on a separate subvolume, so you can manage it separately from your 
other filesystems. Those subvolumes are extensible without hassle if they get 
too small (you can also overallocate the real available space). You can also 
quickly turn subvolumes into RAIDs.

What is also interesting, is a slightly different approach for journaling. This 
might improve performance for us, not yet tested.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Dawid Weiss [mailto:dawid.we...@gmail.com]
> Sent: Monday, December 22, 2014 8:48 AM
> To: java-user@lucene.apache.org
> Cc: Uwe Schindler
> Subject: Re: BTRFS ?
> 
> > I spotted Uwe's comment in JIRA the other day "BTRFS, which might
> > also bring some cool things for Lucene.".
> 
> What cool things about BTRFS are you talking about, Uwe? Just curious.
> 
> Dawid
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BTRFS ?

2014-12-22 Thread Dawid Weiss
Very interesting, have to take a closer look. Thanks Uwe.

D.

On Mon, Dec 22, 2014 at 11:39 AM, Uwe Schindler  wrote:
> Hi Dawid,
>
> there are cool things that might be useful just not for Lucene's Java code. 
> Like ZFS it now has snapshot functionality and you can copy files mostly 
> without doing I/O (shallow-copy, it uses copy-on-write semantics to do that). 
> This might be useful for backup purposes. I know we did most of this already 
> in the index file format, it might just be a good idea to investigate this 
> e.g. for the snapshot/backup functionality of Elasticsearch. You can also 
> quickly put an index on a separate subvolume, so you can manage it separately 
> from your other filesystems. Those subvolumes are extensible without hassle 
> if they get too small (you can also overallocate the real available space). 
> You can also quickly turn subvolumes into RAIDs.
>
> What is also interesting, is a slightly different approach for journaling. 
> This might improve performance for us, not yet tested.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Dawid Weiss [mailto:dawid.we...@gmail.com]
>> Sent: Monday, December 22, 2014 8:48 AM
>> To: java-user@lucene.apache.org
>> Cc: Uwe Schindler
>> Subject: Re: BTRFS ?
>>
>> > I spotted Uwe's comment in JIRA the other day "BTRFS, which might
>> > also bring some cool things for Lucene.".
>>
>> What cool things about BTRFS are you talking about, Uwe? Just curious.
>>
>> Dawid
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: BTRFS ?

2014-12-22 Thread Uwe Schindler
Hi,

In fact, the shallow copy possibility (called "cp --reflink=always") in btrfs 
and other file systems that support it is really interesting. It would be cool 
in Java 7+ 's Files.copy(Path, Path, CopyOption) could use this with an 
additional CopyOption - maybe Java 9. The trick here is to clone the file and 
its inode, but keep the blocks the same (only when one writes to the file, it 
clones the block). This could speed up tests, especially Solr where some dirs 
are copied over and over for every test case. :-)

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Dawid Weiss [mailto:dawid.we...@gmail.com]
> Sent: Monday, December 22, 2014 11:43 AM
> To: java-user@lucene.apache.org
> Subject: Re: BTRFS ?
> 
> Very interesting, have to take a closer look. Thanks Uwe.
> 
> D.
> 
> On Mon, Dec 22, 2014 at 11:39 AM, Uwe Schindler 
> wrote:
> > Hi Dawid,
> >
> > there are cool things that might be useful just not for Lucene's Java code.
> Like ZFS it now has snapshot functionality and you can copy files mostly
> without doing I/O (shallow-copy, it uses copy-on-write semantics to do that).
> This might be useful for backup purposes. I know we did most of this already
> in the index file format, it might just be a good idea to investigate this 
> e.g. for
> the snapshot/backup functionality of Elasticsearch. You can also quickly put
> an index on a separate subvolume, so you can manage it separately from
> your other filesystems. Those subvolumes are extensible without hassle if
> they get too small (you can also overallocate the real available space). You
> can also quickly turn subvolumes into RAIDs.
> >
> > What is also interesting, is a slightly different approach for journaling. 
> > This
> might improve performance for us, not yet tested.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> >> -Original Message-
> >> From: Dawid Weiss [mailto:dawid.we...@gmail.com]
> >> Sent: Monday, December 22, 2014 8:48 AM
> >> To: java-user@lucene.apache.org
> >> Cc: Uwe Schindler
> >> Subject: Re: BTRFS ?
> >>
> >> > I spotted Uwe's comment in JIRA the other day "BTRFS, which
> >> > might also bring some cool things for Lucene.".
> >>
> >> What cool things about BTRFS are you talking about, Uwe? Just curious.
> >>
> >> Dawid
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Spatial Implementation for Points within Polygon.

2014-12-22 Thread david.w.smi...@gmail.com
Hello.

You have stated the use-case so generically that it’s not clear if you
should index the polygon set and query by the point set, or the reverse.
Generally, you should index the set that is known in-advance and then query
by the other, the set that is generally not known.  Assuming this is the
case, index the stable set with RecursivePrefixTreeStrategy, *and*, for
accuracy, if that set is also the polygon set, use SerializedDVStrategy
*or* simply keep them all in-memory keyed by an identifier (call
JtsGeometry.index() on each as well) that you check against at runtime.  If
you don’t have enough RAM then you’ll do the former.  If neither set seems
to be “stable”, you could really index either, definitely choose to index
the points.  The predicate you should use is INTERSECTS; the others are
intended for polygon against polygons (basically any non-point shape
against another non-point shape).

If your scenario is quite simply, you have a bunch of points and polygons
you get all at once to make this computation and then that’s it (no
long-term need to query again by the same polygons or points in the
future), I suggest using JTS directly in-memory, and its PreparedGeometry
to optimize each polygons, then iterate through your points to see which
polygons they are in.  You might even use JTS's STRtree to index polygon
bounding boxes to avoid looping over all polygons.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Dec 22, 2014 at 12:30 AM,  wrote:
>
> Hello Team,
>
> We are starting off with Lucene Spatial implementation for some of the use
> cases:
>
> A . Given "N" polygons and "M" points, find how many points lie inside
> each of the polygon.
>
> 1st Approach :
>
> For A, we indexed Polygons using WKT and using JtsSpatial strategy. I set
> the Level at 22 . This has resulted in huge number of terms. This was
> needed as I need the search to be near perfect.
>
> For Indexing, I used Point(Supplied as WKT) using Jts again with Level at
> 22 (Although I think specifying level at query time does not make much
> difference).
>
> For this, we used ""CONTAINS" .  Output is coming but I am not sure if I
> am doing it the right way. Need suggestion.
>
> I am having following confusion:
>
> a.   Will CONTAINS and IS WITHIN both work in the same way for the
> given scenario. I am ruling OUT INTERSECTS as that scenario is not
> appropriate.
>
> b.  Second, are we missing something  in getting the correct output.
>
>
> 2nd Approach : (Reversed)
>
> Indexed POINTS in WKT format.
> Passed Polygons in WKT using JTs as query and fired as INTERSECTS and
> WITHIN.
>
> In second approach, we are getting more output than the 1st approach.
>
> However, we are still not sure which is the best way to tackle this
> problem. Please suggest.
>
> "Confidentiality Warning: This message and any attachments are intended
> only for the use of the intended recipient(s).
> are confidential and may be privileged. If you are not the intended
> recipient. you are hereby notified that any
> review. re-transmission. conversion to hard copy. copying. circulation or
> other use of this message and any attachments is
> strictly prohibited. If you are not the intended recipient. please notify
> the sender immediately by return email.
> and delete this message and any attachments from your system.
>
> Virus Warning: Although the company has taken reasonable precautions to
> ensure no viruses are present in this email.
> The company cannot accept responsibility for any loss or damage arising
> from the use of this email or attachment."
>


Re: Distance between 2 points Lucene Spatial

2014-12-22 Thread david.w.smi...@gmail.com
Hi Ankit,

Vincenty is the most accurate one — it is the benchmark for the other 2’s
tests for the true answer.  In theory it produces the same answers as the
other 2 simpler formulas you mention but is “numerically robust” for
computers.  Note that the world model used by Spatial4j when in “geo” mode
is a spherical model.  For more accurate distance computation on Earth, use
an ellipsoidal model.  If you google “Vincenty”, it's easy to find
Vincenty’s ellipsoidal formula with the constants for Earth; that is most
often what he is associated with.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Dec 22, 2014 at 12:35 AM,  wrote:
>
> Dear All,
>
> We are using lucene spatial strategy to find out the distance between a
> pair of Lat/Long.
>
> Given a pair of Lat/Long I need to find the near accurate distance between
> these 2 points.
>
> I have used Haversine, LawOfCosines and Vincernity however unable to
> decide which will provide the best output(accurate output).
>
> There is not just 1 point but millions of points which will need to be
> passed into against  a set of point to find the closest point.
>
> Which might be the best approach. Additionally, I observed from the API,
> that the output of these 3 algorithms are in Degress. Is there any API in
> lucene which can return the output in double,long,int etc. formats.
>
>
> "Confidentiality Warning: This message and any attachments are intended
> only for the use of the intended recipient(s).
> are confidential and may be privileged. If you are not the intended
> recipient. you are hereby notified that any
> review. re-transmission. conversion to hard copy. copying. circulation or
> other use of this message and any attachments is
> strictly prohibited. If you are not the intended recipient. please notify
> the sender immediately by return email.
> and delete this message and any attachments from your system.
>
> Virus Warning: Although the company has taken reasonable precautions to
> ensure no viruses are present in this email.
> The company cannot accept responsibility for any loss or damage arising
> from the use of this email or attachment."
>


RE: Distance between 2 points Lucene Spatial

2014-12-22 Thread Ankit.Murarka
Thanks for the suggestion.

I am using Lucene Vincenty to find the distance but the output is strange. I 
cannot figure out how to convert the output to metres/kilo metres.

After extensive search on google, I found GeoDesy source code which gives me 
distance in metres. This is also the implementation of Vincenty.

However, I do not intend to use GeoDesy.

 I would prefer to use inbuilt Vincenty of Lucene to get the distance in metres 
but I am unable to find this.

Please suggest.


-Original Message-
From: david.w.smi...@gmail.com [mailto:david.w.smi...@gmail.com] 
Sent: 22 December 2014 19:33
To: java-user@lucene.apache.org
Subject: Re: Distance between 2 points Lucene Spatial

Hi Ankit,

Vincenty is the most accurate one — it is the benchmark for the other 2’s tests 
for the true answer.  In theory it produces the same answers as the other 2 
simpler formulas you mention but is “numerically robust” for computers.  Note 
that the world model used by Spatial4j when in “geo” mode is a spherical model. 
 For more accurate distance computation on Earth, use an ellipsoidal model.  If 
you google “Vincenty”, it's easy to find Vincenty’s ellipsoidal formula with 
the constants for Earth; that is most often what he is associated with.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer 
http://www.linkedin.com/in/davidwsmiley

On Mon, Dec 22, 2014 at 12:35 AM,  wrote:
>
> Dear All,
>
> We are using lucene spatial strategy to find out the distance between 
> a pair of Lat/Long.
>
> Given a pair of Lat/Long I need to find the near accurate distance 
> between these 2 points.
>
> I have used Haversine, LawOfCosines and Vincernity however unable to 
> decide which will provide the best output(accurate output).
>
> There is not just 1 point but millions of points which will need to be 
> passed into against  a set of point to find the closest point.
>
> Which might be the best approach. Additionally, I observed from the 
> API, that the output of these 3 algorithms are in Degress. Is there 
> any API in lucene which can return the output in double,long,int etc. formats.
>
>
> "Confidentiality Warning: This message and any attachments are 
> intended only for the use of the intended recipient(s).
> are confidential and may be privileged. If you are not the intended 
> recipient. you are hereby notified that any review. re-transmission. 
> conversion to hard copy. copying. circulation or other use of this 
> message and any attachments is strictly prohibited. If you are not the 
> intended recipient. please notify the sender immediately by return 
> email.
> and delete this message and any attachments from your system.
>
> Virus Warning: Although the company has taken reasonable precautions 
> to ensure no viruses are present in this email.
> The company cannot accept responsibility for any loss or damage 
> arising from the use of this email or attachment."
>
"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. 
you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other 
use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the 
sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from 
the use of this email or attachment."


Re: Distance between 2 points Lucene Spatial

2014-12-22 Thread david.w.smi...@gmail.com
I forgot this part of your question.

To go from degrees to KM, multiply by DistanceUtils.DEG_TO_KM.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Dec 22, 2014 at 9:35 AM,  wrote:
>
> Thanks for the suggestion.
>
> I am using Lucene Vincenty to find the distance but the output is strange.
> I cannot figure out how to convert the output to metres/kilo metres.
>
> After extensive search on google, I found GeoDesy source code which gives
> me distance in metres. This is also the implementation of Vincenty.
>
> However, I do not intend to use GeoDesy.
>
>  I would prefer to use inbuilt Vincenty of Lucene to get the distance in
> metres but I am unable to find this.
>
> Please suggest.
>
>
> -Original Message-
> From: david.w.smi...@gmail.com [mailto:david.w.smi...@gmail.com]
> Sent: 22 December 2014 19:33
> To: java-user@lucene.apache.org
> Subject: Re: Distance between 2 points Lucene Spatial
>
> Hi Ankit,
>
> Vincenty is the most accurate one — it is the benchmark for the other 2’s
> tests for the true answer.  In theory it produces the same answers as the
> other 2 simpler formulas you mention but is “numerically robust” for
> computers.  Note that the world model used by Spatial4j when in “geo” mode
> is a spherical model.  For more accurate distance computation on Earth, use
> an ellipsoidal model.  If you google “Vincenty”, it's easy to find
> Vincenty’s ellipsoidal formula with the constants for Earth; that is most
> often what he is associated with.
>
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley
>
> On Mon, Dec 22, 2014 at 12:35 AM,  wrote:
> >
> > Dear All,
> >
> > We are using lucene spatial strategy to find out the distance between
> > a pair of Lat/Long.
> >
> > Given a pair of Lat/Long I need to find the near accurate distance
> > between these 2 points.
> >
> > I have used Haversine, LawOfCosines and Vincernity however unable to
> > decide which will provide the best output(accurate output).
> >
> > There is not just 1 point but millions of points which will need to be
> > passed into against  a set of point to find the closest point.
> >
> > Which might be the best approach. Additionally, I observed from the
> > API, that the output of these 3 algorithms are in Degress. Is there
> > any API in lucene which can return the output in double,long,int etc.
> formats.
> >
> >
> > "Confidentiality Warning: This message and any attachments are
> > intended only for the use of the intended recipient(s).
> > are confidential and may be privileged. If you are not the intended
> > recipient. you are hereby notified that any review. re-transmission.
> > conversion to hard copy. copying. circulation or other use of this
> > message and any attachments is strictly prohibited. If you are not the
> > intended recipient. please notify the sender immediately by return
> > email.
> > and delete this message and any attachments from your system.
> >
> > Virus Warning: Although the company has taken reasonable precautions
> > to ensure no viruses are present in this email.
> > The company cannot accept responsibility for any loss or damage
> > arising from the use of this email or attachment."
> >
> "Confidentiality Warning: This message and any attachments are intended
> only for the use of the intended recipient(s).
> are confidential and may be privileged. If you are not the intended
> recipient. you are hereby notified that any
> review. re-transmission. conversion to hard copy. copying. circulation or
> other use of this message and any attachments is
> strictly prohibited. If you are not the intended recipient. please notify
> the sender immediately by return email.
> and delete this message and any attachments from your system.
>
> Virus Warning: Although the company has taken reasonable precautions to
> ensure no viruses are present in this email.
> The company cannot accept responsibility for any loss or damage arising
> from the use of this email or attachment."
>