[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2016-05-30 Thread Lvton. Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307202#comment-15307202
 ] 

Lvton. Smith commented on LUCENE-6825:
--

bq. It should give sizable performance gains (*smaller index, faster 
searching*) over what we have today, and even over what auto-prefix with 
efficient numeric terms would do.

*smaller index* 
   yes!!

*faster searching??* 
I cannot understand that :(
"Auto-prefix" uses less terms, although larger index. But, "bkd", I think, 
is equal to using "every number term between the range".  Why faster?

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 6.0
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2016-05-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306402#comment-15306402
 ] 

Michael McCandless commented on LUCENE-6825:


I think a B+ tree would be overkill in the 1D case?  Also, Lucene segments are 
write-once, so the ability/freedom to "insert" in the tree (and "delete") would 
be unused.

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 6.0
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2016-05-30 Thread Lvton. Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306351#comment-15306351
 ] 

Lvton. Smith commented on LUCENE-6825:
--

For 1D case, would B+ tree perform as well as bkd-tree? I think so. What about 
you, Michael?

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 6.0
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2016-05-30 Thread Lvton. Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306340#comment-15306340
 ] 

Lvton. Smith commented on LUCENE-6825:
--

For 1D case, would B+ tree perform as well as bkd-tree? I think so. What about 
you, Michael?

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 6.0
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-27 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976085#comment-14976085
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1710752 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1710752 ]

LUCENE-6825: add dimensionally indexed values

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-27 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976476#comment-14976476
 ] 

Steve Rowe commented on LUCENE-6825:


My Jenkins found a seed that reproduces for me: {{TestDimensionalValues}} tests 
({{testMultiValued()}} 100% and {{testMerge()}} sometimes) trigger an NPE in 
{{DimensionalWriter.merge()}} in the {{DimensionalReader.intersect()}} 
implementation there:

{noformat}
   [junit4] Suite: org.apache.lucene.index.TestDimensionalValues
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 
'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestDimensionalValues -Dtests.method=testMultiValued 
-Dtests.seed=367B5FB4E6C5CEFF -Dtests.slow=true 
-Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
-Dtests.locale=ga_IE -Dtests.timezone=Mexico/BajaSur -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   0.57s J0 | TestDimensionalValues.testMultiValued <<<
   [junit4]> Throwable #1: org.apache.lucene.store.AlreadyClosedException: 
this IndexWriter is closed
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF:E25B3B8628078EB7]:0)
   [junit4]>at 
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:713)
   [junit4]>at 
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:727)
   [junit4]>at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1457)
   [junit4]>at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1240)
   [junit4]>at 
org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:173)
   [junit4]>at 
org.apache.lucene.index.TestDimensionalValues.verify(TestDimensionalValues.java:829)
   [junit4]>at 
org.apache.lucene.index.TestDimensionalValues.verify(TestDimensionalValues.java:791)
   [junit4]>at 
org.apache.lucene.index.TestDimensionalValues.testMultiValued(TestDimensionalValues.java:212)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4]> Caused by: java.lang.NullPointerException
   [junit4]>at 
org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56)
   [junit4]>at 
org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139)
   [junit4]>at 
org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45)
   [junit4]>at 
org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168)
   [junit4]>at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117)
   [junit4]>at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055)
   [junit4]>at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635)
   [junit4]>at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
   [junit4]>at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
   [junit4]   2> DFómh 27, 2015 6:47:10 A.M. 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
Thread #1,5,TGRP-TestDimensionalValues]
   [junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.NullPointerException
   [junit4]   2>at 
__randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF]:0)
   [junit4]   2>at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
   [junit4]   2>at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
   [junit4]   2> Caused by: java.lang.NullPointerException
   [junit4]   2>at 
org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56)
   [junit4]   2>at 
org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139)
   [junit4]   2>at 
org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45)
   [junit4]   2>at 
org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168)
   [junit4]   2>at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117)
   [junit4]   2>at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055)
   [junit4]   2>at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635)
   [junit4]   2>at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
   [junit4]   2>at 

[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976481#comment-14976481
 ] 

Michael McCandless commented on LUCENE-6825:


Thanks [~steve_rowe], I'll dig ...

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-27 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976486#comment-14976486
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1710830 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1710830 ]

LUCENE-6825: don't NPE when trying to merge a segment that has no documents 
that indexed dimensional values

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968074#comment-14968074
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709930 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1709930 ]

LUCENE-6825: fix test bug

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968068#comment-14968068
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709928 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1709928 ]

LUCENE-6825: make sure we close open file on exception

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966540#comment-14966540
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709779 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709779 ]

LUCENE-6825: merge trunk

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966534#comment-14966534
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709778 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709778 ]

LUCENE-6825: put this back

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966572#comment-14966572
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709783 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1709783 ]

LUCENE-6825: add low-level support for block-KD trees

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966021#comment-14966021
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709705 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709705 ]

LUCENE-6825: pull out MAX_DIMS as constant; remove sop; add comment

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch, LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963427#comment-14963427
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709429 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709429 ]

LUCENE-6825: simplify the writer/reader classes, remove DEBUG outputs, fix ant 
precommit

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963975#comment-14963975
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709473 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709473 ]

LUCENE-6825: merge trunk

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962676#comment-14962676
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709330 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709330 ]

LUCENE-6825: remove all nocommits; add missing MDW.createTempOutput wrapping; 
fix double-write per dim during build

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962680#comment-14962680
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709331 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709331 ]

LUCENE-6825: merge trunk

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960362#comment-14960362
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1708913 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1708913 ]

LUCENE-6825: cutover to Directory API, fix some bugs

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960745#comment-14960745
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1709001 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1709001 ]

LUCENE-6825: remove sops, add more tests (e.g. BigInteger!); cutover to 
try-with-resources/TrackingDirectoryWrapper; remove some nocommits

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-15 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958695#comment-14958695
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1708778 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1708778 ]

LUCENE-6825: merge trunk

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953813#comment-14953813
 ] 

Michael McCandless commented on LUCENE-6825:


bq. have you considered a new module for this

Well, I think a codec format is a natural way to expose this service, since 
(like postings, doc values, etc.), it's a low-level utility that can be used 
for diverse use cases (2D and 3D spatial, numeric range filtering, binary range 
filtering so we can support IPv6, BigInteger, BigDecimal, etc.).  For it to be 
exposed as a part of the codec means it needs to be in core...


> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951562#comment-14951562
 ] 

David Smiley commented on LUCENE-6825:
--

Exciting!  Mike, have you considered a new module for this; like "bkd" or 
"multidimensional"?  Or of course the existing spatial module? I wonder what 
others think.

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946507#comment-14946507
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1707202 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1707202 ]

LUCENE-6825: make branch

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946509#comment-14946509
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1707203 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1707203 ]

LUCENE-6825: starting patch

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946510#comment-14946510
 ] 

Michael McCandless commented on LUCENE-6825:


I put the current patch on this branch: 
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6825

I'll iterate there...

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946506#comment-14946506
 ] 

Michael McCandless commented on LUCENE-6825:


Thanks [~nknize]!

bq. I'm assuming lat, lon specific naming will be refactored to more 
generalized naming?

Woops, yes: I have a nocommit to make sure I don't talk about lat/lon anymore.  
I'll fix ...

bq. In an XTree implementation (similar to BKD but with more rigorous split 
criteria) I ran into bias issues when sorting by incremental dimensions (simple 
for loop like this).

The bias only happens if we index shapes right?  I agree for indexing shapes 
things get more complex, but I was hoping to do that "later" and focus on 
making points work well for now.

bq. Maybe refactor this into a split method? It would give an opportunity to 
override for investigating improvements based on other split criteria (e.g., 
squareness, area)

Good idea, will do!

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946523#comment-14946523
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1707206 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1707206 ]

LUCENE-6825: rename classes

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946494#comment-14946494
 ] 

Michael McCandless commented on LUCENE-6825:


Thanks [~rjernst]!

bq. I see a number of debugging s.o.p, 

I'll nuke all of these once things seem to be working ... also, those 
{{bytesToInt}} in the reader/writer are temporary just for debugging the test 
case that's encoding ints into the byte[].

{quote}
Is this nocommit still relevant?

// nocommit can we get this back?
//state.docs.grow(count);
{quote}

Unfortunately, yes.  One source of efficiency for the 1D,2D,3D specialized BKDs 
now is they can "notify" the consumer of the incoming docIDs that a chunk is 
coming ... but I didn't expose any means to do this in the intersect API.  I 
could add it but held back for now since that dirties the read-time API.  
Though for the 1D case, this was a big speedup, because in that case I can give 
a pretty accurate estimate up front of how large the result set will be ... 
I'll mull, probably leave as a TODO for now.

bq. Can we avoid the bitflip in bytesToInt by using a long accumulator and 
casting to int?

Hmm I don't think that's the same thing?  We flip that big so that in binary 
space negative ints sort before positive ints?  But remember this is only a 
test issue at this point (for this issue) ... in future issues we will need 
clever "encode/decode foobar as byte[]" ...

bq. Can we limit the num dims to 3 for now?

How about limit to 15?  I think 3 is a bit too low (and 255 is a way too 
high!), but there are maybe interesting things to do w/ more than 3 dims, e.g. 
index household income along with x, y, z points or something.  So this would 
still give us 4 free bits in case we want them later...

bq. In `BKDWriter.finish()`, can the try/catch around building be simplified?

I'll try your suggestion.  I think if I change destroy to close instead, I can 
just tap into IOUtils goodness...

bq. In `BKDWriter.build()`, there is a nocommit about destroying perdim 
writers, but I think that is handled by the caller in `finish()` 

It is confusing!

Each level of the recursion may have written its own files, so I think we need 
the toplevel destroy (removes the fully sorted by each dim file we created up 
front) and the destroy as we recurse (removes the partitioned subset of each 
dim).

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as 

[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946578#comment-14946578
 ] 

ASF subversion and git services commented on LUCENE-6825:
-

Commit 1707213 from [~mikemccand] in branch 'dev/branches/lucene6825'
[ https://svn.apache.org/r1707213 ]

LUCENE-6825: fix some nocommits; remove some difficult try/finally logic

> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-06 Thread Ryan Ernst (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945626#comment-14945626
 ] 

Ryan Ernst commented on LUCENE-6825:


This looks good. I think this is a great plan, to separate this out into a 
separate codec format, and to do it in pieces. And this patch is a good start. 

A few comments on the patch:
* I see a number of debugging s.o.p, these should be commented out or removed?
* Is this nocommit still relevant?
{quote}
// nocommit can we get this back?
//state.docs.grow(count);
{quote}
* There is a nocommit on the {{bytesToInt}} method. I think as a follow up we 
should investigate packing values, but is the nocommit still needed? Also, it 
seems to exist in both reader and writer? Could it be in one place, but still 
package private, perhaps in `oal.util.bkd.Util`?
* Can we avoid the bitflip in {{bytesToInt}} by using a long accumulator and 
casting to int? we can assert no upper bits are set before casting?
* Can we limit the num dims to 3 for now? I see a check for < 255 to fit in a 
byte, but it might be nice later to use those extra bits for some other 
information (essentially lets reserve the bits for now, instead of allowing 
silly number of dimensions)
* In `BKDWriter.finish()`, can the try/catch around building be simplified? I 
think you could remove the `success` marker and do the regular cleanup after 
build, and change the `finally` to a `catch`, then add any failures when 
destroying the per dim writers as suppressed exceptions to the original?
* In `BKDWriter.build()`, there is a nocommit about destroying perdim writers, 
but I think that is handled by the caller in `finish()` mentioned in my 
previous comment? I also see some destroy calls below that...is there double 
destroying going on, or is this more complicated than it looks?


> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6825) Add multidimensional byte[] indexing support to Lucene

2015-10-06 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945944#comment-14945944
 ] 

Nicholas Knize commented on LUCENE-6825:


+1000 for graduating the data structure to core.

{{// Sort all docs once by lat, once by lon:}}  
* I'm assuming lat, lon specific naming will be refactored to more generalized 
naming?
* In an XTree implementation (similar to BKD but with more rigorous split 
criteria) I ran into bias issues when sorting by incremental dimensions (simple 
for loop like this). This is often why sort is done by a reduced dimensional 
encoding value (e.g., Hilbert, Morton). This is particularly important as the 
tree grows (which I'm guessing happens when BKD segments merge?). Maybe another 
new issue to investigate simple interleave packing and sorting on the packed 
value?

{{// Find which dim has the largest span so we can split on it:}}
* Maybe refactor this into a {{split}} method?  It would give an opportunity to 
override for investigating improvements based on other split criteria (e.g., 
squareness, area) 



> Add multidimensional byte[] indexing support to Lucene
> --
>
> Key: LUCENE-6825
> URL: https://issues.apache.org/jira/browse/LUCENE-6825
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6825.patch
>
>
> I think we should graduate the low-level block KD-tree data structure
> from sandbox into Lucene's core?
> This can be used for very fast 1D range filtering for numerics,
> removing the 8 byte (long/double) limit we have today, so e.g. we
> could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
> It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
> (x/y/z with geo3d) geo shape intersection searches.
> The idea here is to add a new part of the Codec API (DimensionalFormat
> maybe?) that can do low-level N-dim point indexing and at runtime
> exposes only an "intersect" method.
> It should give sizable performance gains (smaller index, faster
> searching) over what we have today, and even over what auto-prefix
> with efficient numeric terms would do.
> There are many steps here ... and I think adding this is analogous to
> how we added FSTs, where we first added low level data structure
> support and then gradually cutover the places that benefit from an
> FST.
> So for the first step, I'd like to just add the low-level block
> KD-tree impl into oal.util.bkd, but make a couple improvements over
> what we have now in sandbox:
>   * Use byte[] as the value not int (@rjernst's good idea!)
>   * Generalize it to arbitrary dimensions vs. specialized/forked 1D,
> 2D, 3D cases we have now
> This is already hard enough :)  After that we can build the
> DimensionalFormat on top, then cutover existing specialized block
> KD-trees.  We also need to fix OfflineSorter to use Directory API so
> we don't fill up /tmp when building a block KD-tree.
> A block KD-tree is at heart an inverted data structure, like postings,
> but is also similar to auto-prefix in that it "picks" proper
> N-dimensional "terms" (leaf blocks) to index based on how the specific
> data being indexed is distributed.  I think this is a big part of why
> it's so fast, i.e. in contrast to today where we statically slice up
> the space into the same terms regardless of the data (trie shifting,
> morton codes, geohash, hilbert curves, etc.)
> I'm marking this as trunk only for now... as we iterate we can see if
> it could maybe go back to 5.x...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org